Learn

May 09, 2024

4 Minute Read

What Is AI Infrastructure?

By Muhammad Raza

AI infrastructure is to the technology stack that runs AI workloads. Any AI technology stack consists of:

High Performance Computing (HPC) hardware and networking components
The platform layer
Data workloads
ML models

AI technologies are highly resource intensive and typically rely on bespoke infrastructure as organizations aim to maximize compute efficiency, reliability and scalability of their AI infrastructure.

Components in AI infrastructure

Let’s review the key components of an artificial intelligence infrastructure. (New to IT? Start with this IT infrastructure beginner’s guide.)

AI infrastructure key components

Compute infrastructure

The most interesting AI infrastructure component for AI developers is the specialized hardware technology that is used to train and run AI models. A GPU architecture contains:

Parallel processing cores and threads
High memory bandwidth
Optimized memory hierarchy
Specialized processing units such as Tensor Cores to accelerate parallel matrix multiplication operations as part of model training and inference

HPC CPUs are used more commonly for standardized computing tasks that may be latency-sensitive such as:

Data loading and management
I/O operations
Debugging and development
Model deployment
Execution

(CPUs vs. GPUs: when to use each.)

Storage infrastructure

AI model performance is highly dependent on the data used to train it. In fact, the success of LLMs such as ChatGPT largely comes down to its training data.

While data may be free and publicly available, it takes an efficient storage infrastructure and data platform to ingest, process, analyze and train AI models on large volumes of information at scale. The storage infrastructure consists of:

Cloud-based databases, data warehouses, and data lakes
Distributed file systems
In-house private datacenters

Key considerations associated with the AI storage infrastructure include scalability (with regards to cost of storage), I/O performance, security and compliance.

Networking infrastructure

AI workloads require high performance network fabrics that can handle trillions of AI model executions and compute processes across distributed hardware clusters. The network must be capable of handling load balancing for elephant flow data workloads, especially when the network architecture is developed with hierarchical patterns for efficient data handling.

The performance impact at the physical layer should be minimum — high I/O in real-time data stream processing can lead to packet loss. The network should:

Efficiently manage and control congestion and traffic spikes.
Protect against all variety of cybersecurity threats, such as DDoS.

Platform & application layers

The platform and software/application stack provides resources specific to AI development and model deployment.

ML frameworks such as PyTorch, GPU programming toolkits such as CUDA, and other model-specific frameworks speed up the AI development process. These software tools are typically provisioned as containerized systems that isolate AI development from its underlying AI hardware infrastructure.

Finally, MLOps is adopted to automate the management of:

The AI infrastructure and platform
Tooling delivery
Other infrastructure operations such as resource provisioning, risk management and platform architecture design

Monitoring and optimization runs at the infrastructure layer level, using AI-driven monitoring and analytics tools that analyze traffic from a distributed AI infrastructure including cloud-based and on-premise systems.

(Understand the layers: read about the OSI networking model.)

Downstream AI infrastructure

AI models are deployed in production environments either:

For downstream AI tasks such as edge AI and IoT computing.
As part of another service that integrates with your AI data platform to run AI workloads.

The infrastructure running these services is not part of the AI data and processing pipeline but is integrated via API calls to deliver a secondary downstream service.

For example, Meta uses its Llama 3 GPU clusters primarily for generative AI use cases. And as it expands its GPU cluster portfolio, secondary services — such as ads, search, recommender systems, and ranking algorithms — can take advantage of its genAI models.

All of this requires an expansive data lake platform that can:

Ingest data in real-time.
Process it using advanced AI models.
Finally, respond to user queries efficiently as an integrated downstream service.

(Learn how Splunk AI accelerate detection, investigation and response.)

AI Infrastructure in the real world

Now, let’s look at a specific example of an AI infrastructure.

Meta has recently published the details on its generative AI infrastructure that is used to train and run its latest LLM models including Llama 3. The infrastructure includes two GPU clusters containing 24,576 flagship NVIDIA H100 GPUs. This is an upgrade from its previous AGI infrastructure that contained 16,000 NVIDIA A100 GPUs.

The company further plans to extend its computing capacity by deploying 350,000 H100 chips by the end of this year.

These clusters run on two different network fabric systems:

One network system is designed with RDMA over Converged Ethernet (RoCE).
The other is based on the Nvidia Quantum2 InfiniBand network fabric.

Both solutions offer a 400Gbps endpoint speed. Meta uses its own AI platform called Grand Teton open sourced as part of its Open Compute Project (OCP) initiative. The platform is based on the Open Rack v3 (ORV3) network system design that has been adopted widely as an industry standard. The ORV3 ecosystem includes cooling capabilities optimized for its AI GPU clusters.

Storage is based on the Meta’s Tectonic filesystem that consolidates multitenant filesystem instances for exabyte-scale distributed data workloads. Other storage deployments include high capacity E1.S SSD storage systems based on the YV3 Sierra Point server platform.

AI infrastructure requires significant resources

Certainly AI is on its way to changing a lot about how we work and use the internet today. However, it’s always important to understand the resources — power, money, limited natural resources — that go into running any AI.

Video: Learn more about What Is AI Infrastructure?

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Learn 9 Min Read

What Is Financial Crime Risk Management (FCRM)?

Delve into Financial Crime Risk Management (FCRM). Learn types of crimes, AML compliance, risk assessment, and FCRM solutions for detection and prevention.

Learn 6 Min Read

Data Scanning Explained: What Scanning Data Can Do For You

How do you know which data is the most sensitive? Data scanning is your starting point! Learn how to scan data & get all the details here.

Learn 5 Min Read

Data Trends in 2025: 8 Trends To Follow

Data is changing everything. But what specific trends are really driving these massive shifts? Get the full story on current data trends.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram