A data lake is a data repository for terabytes or petabytes of raw data stored in its original format.
The data can originate from a variety of data sources: IoT and sensor data, a simple file, or a binary large object (BLOB) such as a video, audio, image or multimedia file. Any manipulation of the data — to put it into a data pipeline and make it usable — is done when the data is extracted from the data lake.
With the rapid growth in the amounts of big data generated, ingested and used by organizations every day, data lakes provide the ability to store data as rapidly as it’s received. Data scientists who use data lakes rely on data management tools to make the data sets usable on demand, for initiatives around data discovery, extraction, business intelligence, cleansing and integration.
Data lakes are built using simple object storage methods to house many different formats and types of data. Organizations traditionally built data lakes on-premises — and some still do. However, many companies are also moving their data lakes to remote servers, using cloud storage solutions from major providers like AWS, Azure and GCP, among many others.
Data stored in a data lake can be structured, semi-structured or unstructured data. Even if it is structured data, any metadata or other information appended to it is not usable. Data in a data lake needs to be cleansed, tagged and structured before it can be applied in use cases. One way organizations do this is by following extract, transform and load (ETL) processes, standardizing data formats so they can extract valuable insights.
In the following article, we’ll discuss the components of a data lake, as well as explain how data lakes are used, what their advantages and potential drawbacks are, and what the future of data lakes is in enterprise data storage and management.
Data lakes contain a mix of structured, semi-structured and unstructured data, stored without being cleansed, tagged or manipulated.
A data warehouse contains only structured data. In most data warehouses or data centers, the data has been ingested through an extract, transform, load (ETL) process. It is then organized (staged), cleansed, transformed, cataloged and made available for use.
A database (including a database management system) is used for storing, searching and reporting on data. Unlike data lakes, databases may require schemas and cannot contain semi-structured or unstructured data. On the other hand, a data lake can store raw data from all sources, and structure is only applied to the data when it’s retrieved. Using a data lake doesn’t allow for the same reporting capabilities you would have with a database.
A new option is also taking shape — the data lakehouse.
A data lakehouse is a modern data architecture popular among many organizations that incorporate the features of both data lakes and data warehouses. Much like a data lake, lakehouses store data in these formats:
While also providing data warehouse tools such as:
This combination of functionality means that data lakehouses are useful across all kinds of projects.
The primary benefits of a data lake are speed, scalability and efficiency.
With the ever-growing volumes of traditional data created, ingested and stored by a modern organization, there is significant utility in being able to have a low-cost means of storing data quickly and enabling any authorized person to access data for use rapidly and on demand. By storing as much data as possible, organizations can take advantage of machine learning and predictive analytics
Data lakes can also help break down data silos that have typically impeded organizations from realizing the value of their data. With visibility over that data, insights can inform strategy decisions
As a practical example, historical sales and marketing data can be used to predict future performance, and as more data becomes available — along with more sophisticated machine learning and big data analytics tools — those predictions become more accurate.
There are no real disadvantages to a data lake, because a data lake is just an accumulation of data waiting to be used, and is often paired with other data repositories.
That being said, data lakes require support, often by professionals with expertise in data science, to maintain them and make the data useful.
If you compare a data lake to a structured, relational database, the data lake may seem disorganized, although that isn’t necessarily a fair or accurate comparison.
When a data lake is not managed properly it can sometimes be referred to as a “data swamp.” If allowed to become a data swamp then the data quality, as well as its usefulness and value to the organization, deteriorates, increases latency and becomes a liability for the company. At some point, a data swamp has the same drawbacks and challenges — as well as opportunity cost — of dark data (either stored or real-time data that a company possesses but cannot find, identify, optimize or use).
OK, so we understand that In and of itself, a data lake is a collection of data stored in its native format on a server, either on-premises or in the cloud. Seems simple enough.
Still, it’s important to understand why you’re implementing the distinction when creating a data lake in your organization. Let’s look at some best practices — these best practices are similar to for any major technology undertaking in a large organization:
With the growth of machine learning, data has become more available and usable, while data extraction from data lakes has become significantly faster and easier. Machine learning and data science can make dark data a thing of the past; the more data an organization has, the more information its data analytics systems have to learn from. Data is one of an organization’s most valuable assets. And data lakes give organizations the ability to capture, store and use those assets in the most efficient way.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.