– Data Analytics
Data Lakehouse becomes a strategic analytics tool
Data analysis for the executive suite
Using business analytics to leverage the value of data
“Insurers in particular, who proudly employ many mathematicians, are increasingly creating added value from data – with the help of machine learning and statistics,” explains Dr. Sarah Detzler, Competence Lead Data Science and Machine Learning at SAP. However, every company must find its own way to make beneficial AI use cases visible, to involve customers and employees in the data strategy and to build up expertise and data science infrastructures. Above all companies should enable management boards to make strategically relevant decisions on this basis. One key question is what future contribution will business analytics make in the evaluation of long-term data and corporate strategies.
Data, knowledge, recommendations for action
The necessary business interrelationships have long been mapped out in multidimensional data models, evaluation criteria formalized in suitable key figures and – the classic construct – transferred into a closed solution as the data mart of a data warehouse.
However, the classic data warehouse – with its built-in domain knowledge – quickly reaches its limits: The operational effort to create and further develop its transformation logic is enormous. From a strategic point of view, the tight coupling of the domain logic to the dimensional model approach limits the analytical breadth of the model. Problems that go beyond the familiar economic context can hardly be analyzed meaningfully within this framework. Questions about future scenarios can only be dealt with inadequately – for example, if marketing is targeting new customer segments or an expanded product range. Large amounts of analytical data cannot be stored efficiently and cost-effectively, nor can the results of new analytical methods be successfully integrated.
– OPEN-SOURCE STANDARDS
Open Source plays an important role for data analytics
Open-source standards are also crucial for success when working with data platforms and in data science. After the initial disillusionment with the use of the Hadoop framework, which was underestimated in its complexity, a large number of cloud-based platform-as-a-service infrastructures based on the Spark framework are now emerging. In machine learning and artificial neural networks, there is an increasing concentration on a few open-source standards, which are mostly handed over to open-source organizations such as the Apache Foundation from in-house developments of the major players.
Model-based descriptions versus process evaluation of the data warehouse
With the new paradigm of Data Science – especially the method spectrum of artificial intelligence and machine learning – completely new tools for data-based knowledge processing are available. These are particularly promising for incomplete or only statistically available knowledge. There is a lot of potential in a higher dark processing rate and automated processes – for example in the insurance areas of claims or applications. AI solutions such as telematics, data and usage-based tariffs are often used around new business models. Probabilities such as Next Best Action not only provide marketing with a quick return on investment. In particular, model-based descriptions in a data science approach deviate considerably from a metrics-driven process evaluation of the classic data warehouse and its subject-related data marts.
The analytical breadth for machine knowledge building is usually considerably larger: The required domain knowledge arises in the context of model building together with the database used. However, the explorative procedure necessary for this requires that different data sets be made available easily and quickly.
The models developed with machine learning are thus closely coupled to the underlying data structures, as a so-called data product – a complex combination of data basis, machine learning methodology and specialist analysis knowledge. This encapsulation as a data structure with its own life cycle is difficult to represent in a classic data warehouse – naturally with relational data storage, a focus on data quality and consistency, and poor scalability; the remedy is ubiquitous data storage via a data lake.
Dr. Sarah Detzler, Competence Lead Data Science and Machine Learning at SAP:
“You have to question at what rate new data is coming in and when certain processes are changing. On this basis, you can re-train the model and adapt it to the new data situation.”
Today, the data lake is offered by all cloud providers as an independent infrastructure – an approach derived from the Hadoop framework. It enables open, scalable and cost-effective data storage and supports the use of machine learning with the preventive storage of as much available data as possible. With a schema-on-read architecture adapted to the iterative mode of operation, the data is initially loaded into the data platform without a technical transformation.
All AI results should be API-enabled
Then machine learning platforms also work, as at the Zurich Group Germany. The insurer has built a cloud-based state-of-the-art AI landscape – based on a hyperscaler: “We pull up the platform and build suitable Git repositories for data management and MLOps of our models. In addition, each AI application is encapsulated in a function or container. This allows us to keep up to three different versions in parallel. We have also established clear naming conventions and defined service levels with our Delivery Center in Barcelona. This analyzes, for example, whether our containers and applications are alive,” explains Dr. Michael Zimmer, Chief Data Officer of the Zurich Group Germany. “Our authorization concepts are data protection compliant. All accepting systems are supplied via interfaces; all AI results are therefore API-enabled. We keep the data in a data lake.” When it comes to data content and data structure, however, the data lake is more or less precise. If the data is to be used in a business context, a structure common to all data is required.
Philipp Schützbach, Sales Engineer at Dataiku:
“Anyone who puts a machine learning model into production needs to actively manage the lifecycle of the model – that is, they need to know when and how to retrain a model, and they should understand why the model behaves the way it does.”
Data Lake with central access layer
This structure is created and managed using the Delta Lake approach – as a conceptual extension of the Data Lake. The Delta Lake also contains a central, metadata-driven access layer. It thus combines familiar methods and tools of data preparation with new technologies: SQL, as the preferred tool of the data engineer in the implementation of process logic, is supported, as is a data frame interface as the preferred data structure in a data science context.
A common semantic data view with different representations reduces redundancy in the process logic. It also facilitates the integration of technical know-how. Current Delta Lake developments promise not only high-performance access via SQL, but also promises transaction and integrity security within the framework of the ACID model.
The future belongs to open transaction platforms
The Delta Lake makes it possible to think further about ubiquitous analytical platforms in the direction of a Data Lakehouse. Behind this is the concept of a cross-organizational view of the data inventory without technological restrictions. Organizations, along with management, can refocus their attention on functional data integration in a common information model. The participants see data in their preferred form of representation. Complex transformation processes such as data governance, data quality or single point of truth are provided and maintained centrally where they are necessary. This is exactly what is essential in the fast-moving VUCA world: “Circumstances change, new data is added, and a model can suddenly perform a few percentage points worse than it did at the beginning. You have to question the speed at which new data flows in and when certain processes change. On this basis, you can re-train the model and adapt it to the new data situation,” explains Detzler. “A machine learning model isn’t a no-brainer,” also affirms Philipp Schützbach, Sales Engineer at AI manufacturer Dataiku. “Anyone who sets one up productively must, on the one hand, actively manage the life cycle of the model, in other words: when and how a model needs to be retrained. On the other hand, they should also understand why the model behaves the way it does.”
Dr. Michael Zimmer, Chief Data Officer at Zurich Group Germany:
“We pull up the AI platform and build matching Git repositories for data management and MLOps of our models. Additionally, each AI application is encapsulated in a function or container. This allows us to keep up to three different versions in parallel.”
Delta Lake, Data Lake and Co.: The future belongs to open transaction platforms The article was published as a guest article in the editorial section of “Sapport Magazin”, a German-speaking publication.