Learn

March 17, 2023

6 Minute Read

Data Lakehouses: Everything You Need To Know

By Shanika Wickramasinghe

A data lakehouse is a modern data architecture. It is popular among many organizations that incorporate the features of both data lakes and data warehouses. The features of a data lakehouse make it ideal for a range of data analytics use cases.

This article explains data lakehouses, including how they emerged, how they shape up versus data lakes and data warehouses, their architecture, and finally, the pros and cons of using a data lakehouse.

What is a Data Lakehouse?

A data lakehouse is a data management solution that leverages the best features of a data lake and a data warehouse into a single, unified platform. It addresses the limitations of data lakes and data warehouses when utilizing them separately.

Here are the highlights:

This all-in-one platform enables storing data in raw formats, just like a data lake: in unstructured, semi-structured and structured ways. That means it is a cost-effective and flexible data storage solution, just as any data like is.
Data lakehouses add in what data lakes lack. In a data lakehouse, you also get data management, governance, ACID transactions and data quality—the primary offerings of data warehouses.

What to use data lakehouses for: Key features

Because it has the capabilities of both a data lake and a data warehouse, a data lakehouse can be used for several projects. Some example projects in which it can be utilized include business intelligence (BI), data science, machine learning (ML), AI and SQL analytics. The following is a list of features of a data lakehouse that inherits from data lakes and data warehouses.

Support for every form of data in any file format
Concurrent data reading and writing
Open source
Cost-effective Solution
Flexible and Scalable
Support for real-time data streaming
Data governance and auditing capabilities
Accommodate multiple workloads
Analytics-ready by supporting open data standards such as AVRO, ORC or Parquet
Optimized access to ML and data science tools

How did Data Lakehouses emerge?

Data warehouses emerged in the 1980s as solutions for storing and managing structured data from various sources. They were primarily designed to support data analytics and BI with efficient querying capabilities. Yet, data warehouses couldn't support rapidly evolving unstructured and semi-structured data like pictures, videos or audio recordings. Further, they required data cleaning and transformation to accommodate such data types — this was time-consuming and expensive.

Data lakes emerged in early 2010 as a solution to address the limitations of data warehouses. They provided a low-cost and scalable option for analytics using various data types and formats. Nonetheless, data lakes still had limitations, such as:

Low data quality
Challenges in data governance

The result is that people often maintained both options simultaneously, linking them together to avoid the limitations of data lakes and data warehouses. It often led to issues such as data duplication, high maintenance costs, and security challenges.

Data lakehouses emerged as a better solution to address those challenges. They combine the best features of data lakes and data warehouses. The best part is clear: you no longer have to maintain multiple systems for multiple workloads.

A comparison: Data warehouses vs. data lakes vs. data lakehouses

The following table summarizes how data warehouses and data lakes compare with data lakehouses. It gives you an idea of how data lakehouses combine their features to form a unified solution.

Feature	Data Warehouse	Data Lake	Data Lakehouse
Supporting Data Types	Structured data	Structured, semi-structured, unstructured (raw) and textual data	Structured, semi-structured, unstructured (raw) and textual data
Data Formats	Closed	Open data such as Parquet, ORC, AVRO, etc.	Open and standardized formats such as Parquet, ORC, AVRO, etc.
Data Governance	Simple and well-defined data governance	Poor governance capabilities	Complex but well-defined Data Governance
Data Access	Using SQL-only. Direct file access is not supported	Direct file access is supported with Open APIs	Direct file access is supported with Open APIs and no vendor lock-in
Cost	High cost due to proprietary technologies	Low cost due to open source technologies.	Low cost for data storage
Scalability	Limited scalability and it becomes expensive	High Scalability. It is cost-effective.	High Scalability
Performance	High	Low	High
Use Cases	Business Intelligence and Reporting	Data Analytics, Data Science, ML, and AI	All use cases of data warehouse and data lakes

Components of a Data Lakehouse: The 5 layers

The data lakehouse has a layered architecture with five layers.

Ingestion layer
Storage layer
Metadata layer
API layer
Consumption layer

Data ingestion layer

The data ingestion layer is the bottom layer of a data lakehouse, and it’s responsible for:

Retrieving data from different external and internal data sources.
Ingesting that data into the storage layer above it.

Data sources include Relational and NoSQL databases, social media platforms, websites and other organization-specific applications that generate data. The ingestion layer also has data streaming capabilities for real-time data processing from streaming data sources like IoT sensors.

Data storage layer

The second layer of a typical data lakehouse is the storage layer. It consists of low-cost storage solutions such as AWS S3 and HDFS. Data can be stored as raw data without any transformation, allowing client tools to access that data directly. Additionally, components in the consumption layer and different APIs can access the same data.

Metadata layer

The third layer is the metadata layer, which stores metadata or all the information of data objects in the data storage layer. It also has data management features like ACID transactions, caching, indexing, data versioning and cloning.

The metadata layer can be seen as a unified catalog of metadata. This layer enables data governance, auditing and schema management functionalities.

API layer

The API layer hosts different types of APIs for data analytics and other related data processing activities. It allows machine learning libraries like TensorFlow and MLlib to read directly from the metadata layer. DataFrame APIs help with optimizations, and Metadata APIs can be used to understand the required data.

Data consumption layer

At the top of our layered architecture sits the data consumption layer, which consumes data from the storage layer and accesses the metadata. This layer also hosts various analytics tools like data science, ML and BI tools like Power BI and Tableau. These tools enable organizations to create and run various analytics jobs.

The advantages of Data Lakehouses

Today, you can expect many benefits to using a data lakehouse. At best, you might finally harness the power of your various data sources.

Easy data governance & security

Data lakes offer a management interface to easily control access to the data storage and manage compliance and data quality. It provides fine-grained access control, which can be applied to rows, columns, and views. It also provides attribute-based access control. Additionally, it allows users to set constraints on data quality, data versioning, and data monitoring using an interface like a database administrator.

Reduced data redundancy

A data lake consists of a single unified data storage that can accommodate various data types and cater to various use cases. This single storage solution helps to reduce data duplications, which become an issue when storing data in several systems separately.

Cost-effective solution

Data lakehouses use low-cost storage solutions and reduce the costs of handling multiple databases. They are built on inexpensive and flexible storage technologies such as:

Cloud storage, which can scale on demand and cost per usage.
Hadoop Distributed File System (HDFS), which can store data across multiple servers in a cluster.

In addition, data lakes reduce maintenance costs as they do not require complex ETL processes to prepare the data for analytics and machine learning workloads.

Support multiple workloads using a single platform

The combination of data lakes and data warehouses enables organizations to run multiple workloads. Several users — data developers, business analysts, data scientists — can use the analytics tool of their choice.

Data lakehouses provide direct access to some of the most widely used business intelligence tools, such as Tableau and PowerBI. They also support open data formats and machine learning libraries, such as Python and R libraries. It allows machine-learning engineers and data scientists to fully leverage the power of big data.

Supports innovation, customer interactions

Data lakehouses enable R&D teams to research and test innovative solutions for customer issues as they support the integration of multiple data sources and workloads. It lets them focus on innovations rather than time-consuming data transformation and pre-processing, making them analytics-ready.

Drawbacks of data lakehouses

One main challenge of adopting a data lakehouse is its high learning curve. An organization needs to learn and become familiar with new technologies and tools to completely transform into a data lakehouse. It may involve extra time, effort, and costs for training your organizational staff on how to use and operate the data lakehouse.

Another challenge associated with data lakehouses is storing raw data, which can become unusable without proper security and cataloging. The data cannot be fully trusted without these mechanisms.

Meet you at the (data) lakehouse

The concept of a data lakehouse has emerged as a solution to address the limitations of data warehouses and data lakes. It is a low-cost solution that can store data in various formats and facilitate a range of data analytics workloads. Furthermore, it offers centralized and unified data storage, which is flexible and efficient. A Data Lakehouse is also the best solution for data governance and security.

There are five components of a data lakehouse: data ingestion, storage, metadata, API, and the data consumption layer. As discussed in this article, data lakehouses can benefit organizations in different ways. However, there are also limitations like the higher learning curve and complexities of migrating to this solution and storing raw data. You must consider these facts when leveraging a data lakehouse for your organization.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Shanika Wickramasinghe

Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate about everything she does, loves to travel and enjoys nature whenever she takes a break from her busy work schedule. She also writes for her Medium blog sometimes. You can connect with her on LinkedIn.

Learn 17 Min Read

What Is Process Mining? A Complete Introduction

This in-depth article discusses process mining, including its definition, examples, and how you can get started on process mining implementation for your business.

Learn 7 Min Read

Hacking 101: Black Hat vs. White Hat vs. Gray Hat Hacking

What's the difference between black hat, white hat, and gray hat? Read on to learn about these hacking categories.

Learn 5 Min Read

FedRAMP® Compliance: What It Is, Why It Matters & Tips for Achieving It

Learn about FedRAMP® compliance, the security standard to protect the federal government’s most sensitive unclassified data in the cloud.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram