AI infrastructure is to the technology stack that runs AI workloads. Any AI technology stack consists of:
AI technologies are highly resource intensive and typically rely on bespoke infrastructure as organizations aim to maximize compute efficiency, reliability and scalability of their AI infrastructure.
(Related reading: infrastructure security & Splunk Infrastructure Monitoring.)
Let’s review the key components of an artificial intelligence infrastructure. (New to IT? Start with this IT infrastructure beginner’s guide.)
The most interesting AI infrastructure component for AI developers is the specialized hardware technology that is used to train and run AI models. A GPU architecture contains:
HPC CPUs are used more commonly for standardized computing tasks that may be latency-sensitive such as:
(CPUs vs. GPUs: when to use each.)
AI model performance is highly dependent on the data used to train it. In fact, the success of LLMs such as ChatGPT largely comes down to its training data.
While data may be free and publicly available, it takes an efficient storage infrastructure and data platform to ingest, process, analyze and train AI models on large volumes of information at scale. The storage infrastructure consists of:
Key considerations associated with the AI storage infrastructure include scalability (with regards to cost of storage), I/O performance, security and compliance.
AI workloads require high performance network fabrics that can handle trillions of AI model executions and compute processes across distributed hardware clusters. The network must be capable of handling load balancing for elephant flow data workloads, especially when the network architecture is developed with hierarchical patterns for efficient data handling.
The performance impact at the physical layer should be minimum — high I/O in real-time data stream processing can lead to packet loss. The network should:
The platform and software/application stack provides resources specific to AI development and model deployment.
ML frameworks such as PyTorch, GPU programming toolkits such as CUDA, and other model-specific frameworks speed up the AI development process. These software tools are typically provisioned as containerized systems that isolate AI development from its underlying AI hardware infrastructure.
Finally, MLOps is adopted to automate the management of:
Monitoring and optimization runs at the infrastructure layer level, using AI-driven monitoring and analytics tools that analyze traffic from a distributed AI infrastructure including cloud-based and on-premise systems.
(Understand the layers: read about the OSI networking model.)
AI models are deployed in production environments either:
The infrastructure running these services is not part of the AI data and processing pipeline but is integrated via API calls to deliver a secondary downstream service.
For example, Meta uses its Llama 3 GPU clusters primarily for generative AI use cases. And as it expands its GPU cluster portfolio, secondary services — such as ads, search, recommender systems, and ranking algorithms — can take advantage of its genAI models.
All of this requires an expansive data lake platform that can:
(Learn how Splunk AI accelerate detection, investigation and response.)
Now, let’s look at a specific example of an AI infrastructure.
Meta has recently published the details on its generative AI infrastructure that is used to train and run its latest LLM models including Llama 3. The infrastructure includes two GPU clusters containing 24,576 flagship NVIDIA H100 GPUs. This is an upgrade from its previous AGI infrastructure that contained 16,000 NVIDIA A100 GPUs.
The company further plans to extend its computing capacity by deploying 350,000 H100 chips by the end of this year.
These clusters run on two different network fabric systems:
Both solutions offer a 400Gbps endpoint speed. Meta uses its own AI platform called Grand Teton open sourced as part of its Open Compute Project (OCP) initiative. The platform is based on the Open Rack v3 (ORV3) network system design that has been adopted widely as an industry standard. The ORV3 ecosystem includes cooling capabilities optimized for its AI GPU clusters.
Storage is based on the Meta’s Tectonic filesystem that consolidates multitenant filesystem instances for exabyte-scale distributed data workloads. Other storage deployments include high capacity E1.S SSD storage systems based on the YV3 Sierra Point server platform.
Certainly AI is on its way to changing a lot about how we work and use the internet today. However, it’s always important to understand the resources — power, money, limited natural resources — that go into running any AI.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.