Like physical architecture, the architecture running your business data — any and compute-intensive AI projects — is important. This data architecture governs a very important part of your business: how well users can translate raw information into real knowledge and actionable insights.
Today, your data architecture is getting perhaps more attention than ever before. And that’s all thanks to usable AIs that now exist.
Scalable AI workloads are notoriously compute-intensive: you need several parallel processing compute and storage capabilities to train large AI models continually as new data streams are ingested into your data platform. Indeed, it is the data architecture that determines how the data is stored, processed, analyzed. The data architecture is also responsible for integrating external compute services to run large AI models.
So, what exactly is a data architecture? Let’s take a look.
Let’s define data architecture: Data architecture is the design and organization of systems, processes, models, and frameworks/guidelines that describe how end-to-end data pipelines are implemented. (The data pipeline covers all processes from data ingestion and transformation, to distribution, processing, consumption and storage.)
The design of a data architecture is instrumental to your data management strategy. Not all data architectures must be so robust, but let’s look at data architecture through the lens of AI to help tell this story.
Especially considering the prevalence of Large Language Models (LLMs), which involve billions of model parameters trained on large data volumes, the data architecture must meet the following key requirements:
What makes a data architecture? There are three levels to consider:
Here, a semantic model of high-level components that identifies:
The conceptual design describes relationships and dependencies between these entities and assets, including data, apps and systems involved in the data pipeline.
The logical level includes the data model, platform and schema for data management. Here, you’ll explicitly define entities and relationships — but do keep them independent of technology platforms and software stack.
The actual design and implementation of the components, workflows and processes between different entities defined in the logical and conceptual framework. The data architecture design may involve any combination of:
(Learn about IT monitoring that can monitor all environments.)
When designing or implementing your data architecture, a crucial item to determine in advance is what sort of data storage technology is right for the data project at hand.
At the platform and infrastructure layer levels, your data architecture may employ a data warehouse, data lake or a data lakehouse design principle. This decision is important, so let’s look at the key items to consider. Keep in mind the type of data project you’re experimenting with here: basic business data needs, an AI use case, or something in the middle.
A data lake is a low-cost storage solution that stores data in its raw, unstructured format. It follows a schema-on-read characteristic that allows users to ingest data in real-time — very important — while preprocessing a portion of the required data, conforming to the necessary specification of data analytics and AI tools only prior to consumption.
As a result, the data platform efficiently ingests real-time data streams and rapid integration to diverse third-party AI tools without locking-in to specific tooling specifications and standardizations. This is good for flexibility that modern organizations operate in
However — user beware — the data lake can quickly turn into a “data swamp” where too much information is available with little utility to the end-user.
(Know all the differences: data lakes vs. data warehouses.)
On the other hand, a data warehouse follows a schema-on-write approach. Here, all ingested data is preprocessed and given a predefined structure as it is stored — that’s more upfront work.
This standardized framework is more performant and efficient for batch data processing (and as long as your AI projects and tools don't deviate from standardized specification requirements). However, modern AI use cases rely heavily on real-time data streams, and the schema-on-write preprocessing slows down the data pipeline process. Data warehouse systems also introduce silos in order to comply with diverse tooling specifications.
An alternative is a data lakehouse. Data lakehouses are emerging data storage solutions that couple the characteristics of the data lake and the data warehouse systems. The implementation of a data lakehouse depends on the data architecture design and preferences.
So, your data storage options, discussed above, service the platform and infrastructure levels of your data project. But you’re not done yet.
At the higher abstract level, you will choose a data management design/approach that lets you handle the complexity of your data workloads in a hybrid multi-cloud environment and scales efficiently.
Two modern design principles are the data mesh and data fabric.
Data mesh takes a domain-oriented and decentralized approach where individual teams build their own data pipeline products end-to-end.
The process is federated, but not in silos. Teams have their own autonomy to operate their data environment and can take advantage of data lake platform technologies to maintain a common and unified storage system, where each data use case can preprocess and consume raw data according to unique specifications as required.
Another approach is the data fabric design principle, which builds a unified, holistic and integrated data environment.
The data storage and process layer is integrated seamlessly and uses continuous analytics across several data domains, including:
These data sources and data pipeline processes are reusable and they work across on-premises, hybrid cloud and multi-cloud environments.
The data architectural choices such as data lake vs data warehouse; data fabric vs data mesh; data movement and management strategies determine the flexibility, efficiency, scalability and security capabilities of your end-to-end data pipeline systems and AI use cases.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.