What it is, key features, and benefits. This guide defines and compares data lakehouses, data lakes and data warehouses to help you evaluate which structure is best for your organization.
A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses. It brings the benefits of a data lake, such as low storage cost and broad data access, and the benefits of a data warehouse, such as data structures and management features.
The lakehouse structure ensures that data analysts and data scientists can apply the full and most recent data set toward business intelligence, data analytics and machine learning. And having one system to manage simplifies the enterprise data infrastructure and allows analysts and scientists to work more efficiently.
Here we present the key features of data lakehouses and the benefits they bring to your organization.
FEATURE | BENEFIT |
---|---|
Concurrent read & write transactions Data lakehouses can handle multiple data pipelines. |
Multiple users can concurrently read and write ACID-compliant transactions without compromising data integrity.
|
Data warehouse schema architectures Data lakehouses can apply a schema to all data. |
This means lakehouses can standardize large datasets.
|
Governance mechanisms Lakehouses can support strong governance and auditing capabilities. |
Having a single control point lets you better control publishing, sharing and user access to data.
|
Open & standardized storage formats Lakeshouses use open, standardized storage formats such as AVRO, ORC or Parquet. They additionally support tabular formats too. |
Open formats facilitate broad, flexible and efficient data consumption from BI tools to programming languages such as Python and R. Many also support SQL.
|
Separation of storage & processing Like some modern data warehouses, lakehouses decouple storage and compute resources by using separate clusters for storing and processing. |
You can scale to larger datasets and have more concurrent users. Plus, these clusters run on inexpensive hardware, which saves you money.
|
Support for diverse data types Data lakehouses give you access to structured, semi-structured and unstructured data types. |
This allows you to store, access, refine and analyze a broad range of data types and applications, such as IoT data, text, images, audio, video, system logs and relational data.
|
Support for end-to-end streaming Data lakehouses support data streaming. |
This enables real-time reporting and analysis. Plus, you no longer need separate systems dedicated to serving real-time data apps.
|
Single repository for many applications Lakehouses allow you to use business intelligence tools, process machine learning projects, perform data science, and SQL & analytics directly on a single repository of clean, integrated source data. |
This improves operational efficiency and data quality for BI, ML and other workloads since you only have to maintain one data repository.
|
Historically, you’ve had two primary options for a data repository: data lake or data warehouse. To support BI, data science and machine learning, it’s likely that you’ve had to maintain both of these structures simultaneously and link the systems together. This often leads to data duplication, security challenges and additional infrastructure expense. Data lakehouses can overcome these issues.
Data warehouses hold highly structured and unified data to support specific business intelligence and analytics needs. The data has been transformed and fit into a defined schema.
Pros:
Cons:
Resources:
Data lakes hold raw, source data in a wide variety of formats to directly support data science and machine learning. Massive volumes of structured and unstructured data like ERP transactions and call logs can be stored cost effectively. Data teams can build data pipelines and schema-on-read transformations to make data stored in a data lake available for BI and analytics tools.
Pros:
Cons:
Resources:
The data lakehouse can be more flexible than the traditional data warehouse or data lake in that it can eliminate data redundancy and improve data quality while offering lower cost storage. ETL pipelines provide the critical link between the unsorted lake layer and the integrated warehouse layer. A growing ecosystem of providers such as Databricks, Google BigQuery, Azure Synapse, Snowflake and Amazon Redshift, offer solutions that allow the data to be cost-effectively stored.
Pros:
Cons:
Modern data integration delivers real-time, analytics-ready and actionable data to any analytics environment, from Qlik to Tableau, Power BI and beyond.