Supercomputer networking to accelerate large scale AI training

May 6, 2026

TL;DR

OpenAI partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to create MRC (Multipath Reliable Connection).
MRC is a novel protocol designed to improve GPU networking performance and resilience in large training clusters for AI models.
The protocol enables the creation of multi-plane high-speed networks with redundancy to withstand failures, using fewer components and less power.
MRC's adaptive packet spraying virtually eliminates core network congestion by distributing traffic across many paths.
Static source routing is used to bypass failures and simplify network control planes, enhancing reliability.
MRC is already deployed in OpenAI's large NVIDIA GB200 supercomputers and has been used to train multiple OpenAI models.
The MRC specification is now available through the Open Compute Project (OCP) for the broader industry.
Key benefits include building high-speed networks with over 100,000 GPUs using only two tiers of switches, eliminating congestion, and enabling quick bypass of failures.

Continue reading the original article