Learn

January 10, 2024

5 Minute Read

What Is Data Architecture?

By Muhammad Raza

Like physical architecture, the architecture running your business data — any and compute-intensive AI projects — is important. This data architecture governs a very important part of your business: how well users can translate raw information into real knowledge and actionable insights.

Today, your data architecture is getting perhaps more attention than ever before. And that’s all thanks to usable AIs that now exist.

Scalable AI workloads are notoriously compute-intensive: you need several parallel processing compute and storage capabilities to train large AI models continually as new data streams are ingested into your data platform. Indeed, it is the data architecture that determines how the data is stored, processed, analyzed. The data architecture is also responsible for integrating external compute services to run large AI models.

So, what exactly is a data architecture? Let’s take a look.

What is data architecture?

Let’s define data architecture: Data architecture is the design and organization of systems, processes, models, and frameworks/guidelines that describe how end-to-end data pipelines are implemented. (The data pipeline covers all processes from data ingestion and transformation, to distribution, processing, consumption and storage.)

The design of a data architecture is instrumental to your data management strategy. Not all data architectures must be so robust, but let’s look at data architecture through the lens of AI to help tell this story.

Requirements of data architecture

Especially considering the prevalence of Large Language Models (LLMs), which involve billions of model parameters trained on large data volumes, the data architecture must meet the following key requirements:

Scalable storage of data ingested in structured and unstructured data streams.
Efficient handling of real-time data streams and batch workloads of structured, unstructured and semi-structured formats.
Data movement strategies that seamlessly integrate third-party AI tools into your data pipeline workflows.
Flexibility to leverage purpose-built data stores, third-party services, multi-cloud environments and a hybrid mix of traditional legacy data frameworks where necessary.
Automation and adoption of predefined globally accepted standardizations and protocols for data management, security, networking and analytics processing.
Low complexity and learning curve for cross-functional departments to develop and implement their custom analytics and AI use cases within the data pipeline.
Embeds strong GRC (governance, risk and compliance) capabilities.

Components of a data architecture

What makes a data architecture? There are three levels to consider:

Conceptual level

Here, a semantic model of high-level components that identifies:

All business, functional and system entities
Process workflows and operations of the data pipeline

The conceptual design describes relationships and dependencies between these entities and assets, including data, apps and systems involved in the data pipeline.

Logical level

The logical level includes the data model, platform and schema for data management. Here, you’ll explicitly define entities and relationships — but do keep them independent of technology platforms and software stack.

Implementation level

The actual design and implementation of the components, workflows and processes between different entities defined in the logical and conceptual framework. The data architecture design may involve any combination of:

On-site data centers
Cloud-based systems
Data lake models or database platform models

(Learn about IT monitoring that can monitor all environments.)

Determining the storage options for your data architecture

When designing or implementing your data architecture, a crucial item to determine in advance is what sort of data storage technology is right for the data project at hand.

At the platform and infrastructure layer levels, your data architecture may employ a data warehouse, data lake or a data lakehouse design principle. This decision is important, so let’s look at the key items to consider. Keep in mind the type of data project you’re experimenting with here: basic business data needs, an AI use case, or something in the middle.

Data lakes

A data lake is a low-cost storage solution that stores data in its raw, unstructured format. It follows a schema-on-read characteristic that allows users to ingest data in real-time — very important — while preprocessing a portion of the required data, conforming to the necessary specification of data analytics and AI tools only prior to consumption.

As a result, the data platform efficiently ingests real-time data streams and rapid integration to diverse third-party AI tools without locking-in to specific tooling specifications and standardizations. This is good for flexibility that modern organizations operate in

However — user beware — the data lake can quickly turn into a “data swamp” where too much information is available with little utility to the end-user.

(Know all the differences: data lakes vs. data warehouses.)

Data warehouse

On the other hand, a data warehouse follows a schema-on-write approach. Here, all ingested data is preprocessed and given a predefined structure as it is stored — that’s more upfront work.

This standardized framework is more performant and efficient for batch data processing (and as long as your AI projects and tools don't deviate from standardized specification requirements). However, modern AI use cases rely heavily on real-time data streams, and the schema-on-write preprocessing slows down the data pipeline process. Data warehouse systems also introduce silos in order to comply with diverse tooling specifications.

Data lakehouse

An alternative is a data lakehouse. Data lakehouses are emerging data storage solutions that couple the characteristics of the data lake and the data warehouse systems. The implementation of a data lakehouse depends on the data architecture design and preferences.

Data management design approach: data mesh vs. data fabric

So, your data storage options, discussed above, service the platform and infrastructure levels of your data project. But you’re not done yet.

At the higher abstract level, you will choose a data management design/approach that lets you handle the complexity of your data workloads in a hybrid multi-cloud environment and scales efficiently.

Two modern design principles are the data mesh and data fabric.

The data mesh approach

Data mesh takes a domain-oriented and decentralized approach where individual teams build their own data pipeline products end-to-end.

The process is federated, but not in silos. Teams have their own autonomy to operate their data environment and can take advantage of data lake platform technologies to maintain a common and unified storage system, where each data use case can preprocess and consume raw data according to unique specifications as required.

The data fabric approach

Another approach is the data fabric design principle, which builds a unified, holistic and integrated data environment.

The data storage and process layer is integrated seamlessly and uses continuous analytics across several data domains, including:

Raw data
Processed information
Metadata from inferred analytics outcomes
Purpose-built data stores

These data sources and data pipeline processes are reusable and they work across on-premises, hybrid cloud and multi-cloud environments.

Data architecture determines how effectively you can use your data

The data architectural choices such as data lake vs data warehouse; data fabric vs data mesh; data movement and management strategies determine the flexibility, efficiency, scalability and security capabilities of your end-to-end data pipeline systems and AI use cases.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Learn 6 Min Read

What is OMB M-21-31?

OMB M-21-31 mandates that federal agencies increase their IT visibility and response capabilities before, during & after cybersecurity incidents. Get all the details here.

Learn 4 Min Read

AI Tools & Vendors for The Enterprise

Looking for AI tools for the enterprise? Read on for the full story about AI and software for business use cases.

Learn 7 Min Read

The Ultimate Guide to Business Metrics

Metrics help you measure and understand what’s really going on inside your business. But where to start? Where to simplify? This guide has you covered.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram