What Is a Data Lakehouse?
Two years ago, when a colleague used the terms interchangeably, we wrote about data lakes vs. data warehouses. Over time, technology patterns advance, morph, and blur. Today we’re using data lakehouses, a relatively new concept that blends the two models.
What Is a Data Warehouse?
Amazon defines a data warehouse as a central repository of information that can be analyzed to make more informed decisions. Typically, a data warehouse aggregates data from disparate enterprise sources for reporting. Building them requires mapping, cleaning, extracting, transforming, and loading various data. On regular intervals, data is refreshed. Two industry methods drive design. First, Ralph Kimball has advocated using a dimensional data warehouse to organize all data into dimensions for reporting data at a particular point in time. Second, Bill Inmon’s methodology uses data marts to separate departmental data from enterprise data warehouses. Regardless of the structure, the warehouse enables reporting, business intelligence, and data analysis to summarize operational data.
What Is a Data Lake?
Google defines a data lake as a centralized repository to store, process, and secure large amounts of structured, semi-structured, and unstructured data. Where a data warehouse is always structured (requiring more effort to build and little to reports), a data lake is more free-form (requiring little effort to build but more to report). Lakes are more flexible, scalable, and versatile but often less accurate. It simplifies ingesting and storing data while making it faster to perform batch, streaming, and interactive analytics.
What Is a Data Lakehouse?
Today, organizations with larger datasets have another option for storage architecture, a hybrid called a “data lakehouse”. It can house both structured (like data warehouse) and unstructured (like data lake) data. We identify and extract features of the data into a structure, allowing it to be organized more like a warehouse. That is, it combines the flexible storage of unstructured data of a data lake with the tools & management features (e.g., data cleansing, ETL, and schema enforcement) of a data warehouse.
Data Lakehouse Benefits
Like a data warehouse, a data lakehouse maintains consistency, isolation, and durability of data. You can use off-the-shelf BI tools, as all data resides in one platform. Costs are reduced since data isn’t kept in several storage systems simultaneously. The lakehouse eliminates redundancies if an organization uses a data lake and several data warehouses. It also solves the stagnation problem of big data. And lakehouses can house large volumes of diverse data, but still facilitate advanced analytics, reporting, and machine learning.
Most organizations don’t need a lakehouse — they would be better off with a data warehouse. Lakehouses are relatively new, so there’s not really an out-of-the-box approach to ensure success, which means it can take time to set up. It can also be costly to maintain if you’re figuring it out as you go along.
How a Data Lakehouse Works
A data lakehouse allows you to provide different user experiences for different needs and users. For example, a CFO looking at supply chain data and a product engineer looking for sensor defects would need different levels of tooling and details in data. As a results, lakehouses can be complex to build from scratch. To address diverse needs, engineers build different layers into the lakehouse.
Five layers comprise a data lakehouse system:
Ingestion pulls data from a variety of sources (e.g., relational databases, NoSQL databases, SaaS, CRMs, IoT sensors), and moves it to the storage layer.
The design enables keeping all kinds of data in low-cost object stores like Amazon S3. Then tools can read objects directly from storage.
The metadata layer provides a unified, structured catalog for all objects in the lake, including data governance and auditing functionality.
APIs enable end users to process tasks and access advanced analytics. Metadata APIs describe required data elements to access and retrieve application data.
The consumption layer hosts tools and applications to access data and metadata stored in a lake. You’ll find a vast array of tooling options, with the most popular including Power BI, Tableau, Apache Drill, Amazon Athena, Snowflake, Databricks, Azure Synapse Analytics, and Infor Data Lake.