tech
May 6, 2026
Supercomputer networking to accelerate large scale AI training
Frontier model training depends on reliable supercomputer networks that can quickly move data between GPUs. To make this faster and more efficient, OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection): a novel protocol that improves GPU networking performance and resilience in large training clusters. We released MRC today through the Open Compute Project (OCP) to enable the broader industry to use it.

TL;DR
- OpenAI partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to create MRC (Multipath Reliable Connection).
- MRC is a novel protocol designed to improve GPU networking performance and resilience in large training clusters for AI models.
- The protocol enables the creation of multi-plane high-speed networks with redundancy to withstand failures, using fewer components and less power.
- MRC's adaptive packet spraying virtually eliminates core network congestion by distributing traffic across many paths.
- Static source routing is used to bypass failures and simplify network control planes, enhancing reliability.
- MRC is already deployed in OpenAI's large NVIDIA GB200 supercomputers and has been used to train multiple OpenAI models.
- The MRC specification is now available through the Open Compute Project (OCP) for the broader industry.
- Key benefits include building high-speed networks with over 100,000 GPUs using only two tiers of switches, eliminating congestion, and enabling quick bypass of failures.