Many business organizations begin their data analytics journey with great expectations of discovering hidden insights from data. The concept of unified storage — data lake technologies in the cloud — has gained momentum in recent years, especially with the exponential options for cost-effective cloud-based storage services.
Big data is readily available, with 2.5 quintillion (2.5 x 10^18 or 2.5 billion) bytes generated every day! The challenge facing these organizations centers around the nature of this data. Big data generates in three forms — structured, unstructured, and semi-structured. Preprocess data to specifications before it is ready for analytics consumption.
In this article, we’ll look at what these data structures mean for business analytics.
Structured data follows a fixed predefined format, usually in a quantitative and organized form. A great example is a database with customer names, addresses, phone numbers, email IDs, and billing information.
The pros of structured data are clear: this format can be consumed directly by an analytics tool and may not require any additional reformatting. However, this data can only be used for its intended purpose with the tools that require its schema formatting.
Semi-structured data is not “in-between” structured and unstructured data. Instead, this is a form of structured data that does not conform to the structure schema of databases.
Data entities that belong to the same class are instead described by metadata tags or other semantic markers that give some structure to the data assets, differentiating it completely from an unstructured data format. As an example:
Unstructured data is usually qualitative data that needs preprocessing before it can be made available to analytics tools for consumption. Examples include:
In its native format, unstructured data can be stored in a unified storage repository, a data lake. It accumulates and scales rapidly — most real-time data streams are generated in unstructured format. To consume unstructured data, you have to use specialized tools and rely on expertise to give it the required structure scheme.
(Learn about normalizing data.)
Let’s explore what this means for your data analytics journey:
Structured data follows a fixed predefined format, usually in a quantitative and organized form. A great example is a database with customer names, addresses, phone numbers, email IDs, and billing information. Structured data typically comes from relational databases, enterprise systems, and other organized data sources.
Pros
Cons
Unstructured data is usually qualitative data that needs preprocessing before it can be made available to analytics tools for consumption. Examples include raw IoT data, network logs, audio and video data, social media posts, and data generated at the machine level. It often originates from sources like sensors, social media platforms, multimedia files, and machine logs.
Pros
Cons
Semi-structured data is a form of structured data that does not conform to the strict schema of databases. Data entities that belong to the same class are described by metadata tags or other semantic markers. Examples include tab-delimited files, XML and JSON documents, and data from email systems.
Pros
Cons
If your data pipeline is built with a data lake, you can take advantage of the flat storage architecture to source data in all formats. A pre-built schema is not required and the data can later be queried by giving it some structure as required — schema-on-read — or using the fixed order of data acquisition. Metadata tags are commonly used during the querying process, which means that a solid metadata management strategy must be in place.
The process of extracting, loading, and transforming data (ETL) should be automated and simplified to meet the scalability needs of the data platform. Since this preprocessing step only takes place when an analytics application queries the data, the data lake can handle workloads with write-heavy and read-heavy schema requirements. This means that the data platform can be flexible, scalable, and cost-effective, given the availability of low-cost cloud storage options.
This pipeline workflow incentivizes organizations to leverage data of all structures and formats while avoiding the resource-intensive schema-on-write process for real-time unstructured data streams that can quickly grow in volume.
WIth all that we've covered, you may be wondering why you shouldn't just focus on structured data that complies with the required tooling specifications? Or use a traditional data warehouse system that employs a schema-on-write method to preprocess all data before storage as required?
There's a few things to consider.
Data lake technology embodies the idea that data lakes accelerate the data analytics process, turning away no data. Data lakes load all data from source systems directly at the leaf level.
This approach gives analytics teams the freedom to access a growing pool of real-time data streams, processing only the portion of data required by the tooling. (In most cases, that portion is well under 10%.)
Unlike the rigid schema-based model of a data warehouse system, a data lake allows for scalable analytics operations such as:
This flexibility is crucial for modern analytics environments where data types and data sources are continually evolving.
Structured and unstructured data assets scale differently, and there may be no consistent approach to modeling heterogeneous data assets with a single schema framework.
Data lakes offer a more cost-effective and efficient solution by storing raw data in its native format, thus reducing the need for extensive preprocessing and transformation.
An effective data management strategy focuses on the security, auditability, and transparency of structured, unstructured, and semi-structured data assets.
Govern and classify the data to securely manage access between relevant data consumers and data producers. This enables self-service functionality and offers the flexibility to integrate multiple third-party analytics tools. Each with its own set of schema and structure requirements.
It's clear that while structured data offers ease of use and consistency— the flexibility, scalability, and cost-effectiveness of data lakes make them a superior choice for handling diverse data types. Consequently, this approach allows organizations to leverage the strengths of all data structures, ensuring comprehensive and effective data analytics practices.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.