A data lakehouse is a modern data architecture. It is popular among many organizations that incorporate the features of both data lakes and data warehouses. The features of a data lakehouse make it ideal for a range of data analytics use cases.
This article explains data lakehouses, including how they emerged, how they shape up versus data lakes and data warehouses, their architecture, and finally, the pros and cons of using a data lakehouse.
A data lakehouse is a data management solution that leverages the best features of a data lake and a data warehouse into a single, unified platform. It addresses the limitations of data lakes and data warehouses when utilizing them separately.
Here are the highlights:
Because it has the capabilities of both a data lake and a data warehouse, a data lakehouse can be used for several projects. Some example projects in which it can be utilized include business intelligence (BI), data science, machine learning (ML), AI and SQL analytics. The following is a list of features of a data lakehouse that inherits from data lakes and data warehouses.
Data warehouses emerged in the 1980s as solutions for storing and managing structured data from various sources. They were primarily designed to support data analytics and BI with efficient querying capabilities. Yet, data warehouses couldn't support rapidly evolving unstructured and semi-structured data like pictures, videos or audio recordings. Further, they required data cleaning and transformation to accommodate such data types — this was time-consuming and expensive.
Data lakes emerged in early 2010 as a solution to address the limitations of data warehouses. They provided a low-cost and scalable option for analytics using various data types and formats. Nonetheless, data lakes still had limitations, such as:
The result is that people often maintained both options simultaneously, linking them together to avoid the limitations of data lakes and data warehouses. It often led to issues such as data duplication, high maintenance costs, and security challenges.
Data lakehouses emerged as a better solution to address those challenges. They combine the best features of data lakes and data warehouses. The best part is clear: you no longer have to maintain multiple systems for multiple workloads.
The following table summarizes how data warehouses and data lakes compare with data lakehouses. It gives you an idea of how data lakehouses combine their features to form a unified solution.
Feature |
Data Warehouse |
Data Lake |
Data Lakehouse |
Supporting Data Types |
Structured data |
Structured, semi-structured, unstructured (raw) and textual data |
Structured, semi-structured, unstructured (raw) and textual data |
Data Formats |
Closed |
Open data such as Parquet, ORC, AVRO, etc. |
Open and standardized formats such as Parquet, ORC, AVRO, etc. |
Data Governance |
Simple and well-defined data governance |
Poor governance capabilities |
Complex but well-defined Data Governance |
Data Access |
Using SQL-only. Direct file access is not supported |
Direct file access is supported with Open APIs |
Direct file access is supported with Open APIs and no vendor lock-in |
Cost |
High cost due to proprietary technologies |
Low cost due to open source technologies. |
Low cost for data storage |
Scalability |
Limited scalability and it becomes expensive |
High Scalability. It is cost-effective. |
High Scalability |
Performance |
High |
Low |
High |
Use Cases |
Business Intelligence and Reporting |
Data Analytics, Data Science, ML, and AI |
All use cases of data warehouse and data lakes |
The data lakehouse has a layered architecture with five layers.
The data ingestion layer is the bottom layer of a data lakehouse, and it’s responsible for:
Data sources include Relational and NoSQL databases, social media platforms, websites and other organization-specific applications that generate data. The ingestion layer also has data streaming capabilities for real-time data processing from streaming data sources like IoT sensors.
The second layer of a typical data lakehouse is the storage layer. It consists of low-cost storage solutions such as AWS S3 and HDFS. Data can be stored as raw data without any transformation, allowing client tools to access that data directly. Additionally, components in the consumption layer and different APIs can access the same data.
The third layer is the metadata layer, which stores metadata or all the information of data objects in the data storage layer. It also has data management features like ACID transactions, caching, indexing, data versioning and cloning.
The metadata layer can be seen as a unified catalog of metadata. This layer enables data governance, auditing and schema management functionalities.
The API layer hosts different types of APIs for data analytics and other related data processing activities. It allows machine learning libraries like TensorFlow and MLlib to read directly from the metadata layer. DataFrame APIs help with optimizations, and Metadata APIs can be used to understand the required data.
At the top of our layered architecture sits the data consumption layer, which consumes data from the storage layer and accesses the metadata. This layer also hosts various analytics tools like data science, ML and BI tools like Power BI and Tableau. These tools enable organizations to create and run various analytics jobs.
Today, you can expect many benefits to using a data lakehouse. At best, you might finally harness the power of your various data sources.
Data lakes offer a management interface to easily control access to the data storage and manage compliance and data quality. It provides fine-grained access control, which can be applied to rows, columns, and views. It also provides attribute-based access control. Additionally, it allows users to set constraints on data quality, data versioning, and data monitoring using an interface like a database administrator.
A data lake consists of a single unified data storage that can accommodate various data types and cater to various use cases. This single storage solution helps to reduce data duplications, which become an issue when storing data in several systems separately.
Data lakehouses use low-cost storage solutions and reduce the costs of handling multiple databases. They are built on inexpensive and flexible storage technologies such as:
In addition, data lakes reduce maintenance costs as they do not require complex ETL processes to prepare the data for analytics and machine learning workloads.
The combination of data lakes and data warehouses enables organizations to run multiple workloads. Several users — data developers, business analysts, data scientists — can use the analytics tool of their choice.
Data lakehouses provide direct access to some of the most widely used business intelligence tools, such as Tableau and PowerBI. They also support open data formats and machine learning libraries, such as Python and R libraries. It allows machine-learning engineers and data scientists to fully leverage the power of big data.
Data lakehouses enable R&D teams to research and test innovative solutions for customer issues as they support the integration of multiple data sources and workloads. It lets them focus on innovations rather than time-consuming data transformation and pre-processing, making them analytics-ready.
One main challenge of adopting a data lakehouse is its high learning curve. An organization needs to learn and become familiar with new technologies and tools to completely transform into a data lakehouse. It may involve extra time, effort, and costs for training your organizational staff on how to use and operate the data lakehouse.
Another challenge associated with data lakehouses is storing raw data, which can become unusable without proper security and cataloging. The data cannot be fully trusted without these mechanisms.
The concept of a data lakehouse has emerged as a solution to address the limitations of data warehouses and data lakes. It is a low-cost solution that can store data in various formats and facilitate a range of data analytics workloads. Furthermore, it offers centralized and unified data storage, which is flexible and efficient. A Data Lakehouse is also the best solution for data governance and security.
There are five components of a data lakehouse: data ingestion, storage, metadata, API, and the data consumption layer. As discussed in this article, data lakehouses can benefit organizations in different ways. However, there are also limitations like the higher learning curve and complexities of migrating to this solution and storing raw data. You must consider these facts when leveraging a data lakehouse for your organization.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.