Artificial Intelligence (AI) is the hottest topic of the day. It screams from headlines. Companies are scrambling to establish an AI position. AI’s massive potential is increasingly understood and appreciated, from the arts to the sciences, and beyond, thanks to Generative AI applications like ChatGPT
As countless predictions are made about AI’s value, a growing chorus of concern demands a cautious approach.
One thing is clear: we’re only seeing the tip of the AI iceberg. As it evolves, just like the iPhone before it, AI is poised to become exponentially more than anything that can be imagined today.
The communications industry is already being impacted by AI, which is having a massive effect on the data center workloads that are also catering to edge computing and cloud-based 5G networks traffic.
Large cloud service providers (CSPs) are seeing the earliest impact from massive AI workloads. Data center operators are right behind, already grappling with terabit networking thresholds to handle the projected AI demand for bandwidth and compute.
Achieving these terabit thresholds can’t be solved by adding more server racks or fiber runs. Data centers need to be rearchitected to meet the explosive growth of AI workloads.
In this blog, we recap insights from a joint Spirent and Dell’Oro webinar on “”.
AI demands data center network transformation
AI models are growing in complexity by 1,000 times every three years, requiring low latency, high bandwidth connectivity between servers, storage, and xPUs (a device abstraction that can be mapped to CPU, GPU, FPGA, and other accelerators) [BP1] for AI training and inferencing (the generation of AI intelligence).
There’s no way around it: data centers and the high-speed networks they rely on must transform to efficiently and sustainably support AI’s rapid uptake.
The complexity and size of AI applications dictate the number of xPUs needed to run the apps, the amount of memory, and the type and scale of the network fabric needed to connect all the xPUs. As the scale of AI applications is skyrocketing, requiring thousands to tens of thousands of xPUs and trillions of dense parameters in the near future would not be surprising.
With that kind of scale, a data center can’t just keep adding racks. To handle large AI workloads, a separate, scalable, routable back-end network infrastructure is needed for xPU-to-xPU connectivity. AI apps have much less of an impact on the front-end Ethernet networks that provide data ingestion related to the training process of AI algorithms.
The requirements for this separate back-end network – which relates to AI inference – differ considerably from the traditional data center front-end access network. In addition to five times more traffic and increased network bandwidth per accelerator, the back-end network needs to support thousands of synchronized parallel jobs and data- and compute-intensive workloads.
Since the progression of all nodes can be held back by any delayed flow, network latency is a critical issue for AI workload performance. Even before the anticipated massive AI workloads, latency is a problem. According to Meta, on average, 33% of AI elapsed time is spent waiting for the network. Such delays incur timeouts that impact customer service, are costly, and impede scalability.
AI workloads are driving an unprecedented need for back-end network low latency and high bandwidth connectivity between servers, storage, and the xPUs that are essential for AI training and inferencing.
Adoption of high-speed networking
We are at the early stages of data center design evolving to start catering to AI workloads.
Dell’Oro Group provided a forecast for 2023-2027 that addresses questions many companies are asking about the timing and adoption rate of high-speed networks.
Front-end network ports are expected to remain Ethernet. Initial adoption of next generation speeds will be initially driven by front-end connectivity to AI clusters for data ingestion. By 2027, Dell’Oro expects one third of overall Ethernet ports in the front-end network will be 800 Gbps speeds or higher.
In contrast, back-end AI networks are projected to migrate quickly to nearly all port speeds being at 800 Gbps and above by 2027, with triple-digit CAGR for bandwidth growth. Back-end networks will include both Ethernet and InfiniBand, which are expected to co-exist for the foreseeable future.
One size doesn’t fit all deployments
AI data center network back-end deployment approaches for AI applications are already quite variable, with hyperscalers Google, Microsoft, and Amazon taking different paths. Deployments like AI training require a lossless back-end network such as InfiniBand. Other implementations prefer standardized, well-understood Ethernet and some use both InfiniBand and Ethernet.
One solution doesn’t fit all needs, and convergence on a single path is not expected any time soon.
Factors and tradeoffs that influence modern data center architectures include:
Size of deployments and number of clusters
Complexity of applications and workloads
The relative importance of low and deterministic latency, as well as high bandwidth, for the AI applications
Bandwidth and load balancing application requirements, e.g., the number of lanes in an 800 Gbps channel
Whether compute- and time-intensive AI training will be outsourced or performed in-house
Standardized versus proprietary technologies and their anticipated evolutions to meet the needs of AI
Desire for a diversified, multivendor supply chain
The AI data center journey is just beginning and will change dramatically as AI evolves. Even the hyperscalers are trying to determine the best fabric for their AI workloads for today and for the near future, recognizing that data centers being built today which are not properly planned may be obsolete in two years.
Validating high-speed, low latency AI networks
As AI technology innovations continue their rapid evolution, the networks they rely on must be validated and tested to ensure they meet the needs of growing AI workloads.
As a leading provider of test and assurance solutions for next-generation devices and networks, Spirent provides.
Learn more about AI networking challenges and solutions in thison the impact of AI workloads on modern data center networks.