New Protocol Addresses Bottlenecks in Large-Scale AI Training

A consortium of major technology companies—OpenAI, AMD, Broadcom, Intel, Microsoft, and Nvidia—have jointly developed a new networking protocol called ‘MRC (Multipath Reliable Connection)’ to address growing challenges with data transmission speeds for artificial intelligence workloads.

The MRC protocol is designed to distribute traffic across multiple network paths simultaneously rather than concentrating it on limited connections. This approach helps prevent bottlenecks that occur when training extremely large AI models requiring the movement of massive datasets—particularly those utilizing 100,000 or more GPUs.

According to OpenAI’s official announcement, “Network congestion and link/device failures are common causes of data transmission delays and jitter.” They explained that as cluster sizes increase, these issues become more frequent and difficult to resolve. Single points of failure can halt training runs requiring restarts from saved checkpoints, with network recalculations sometimes taking several seconds.

Key Benefits of the MRC Protocol

  • Improved Resilience: The protocol’s multi-path design allows it to automatically reroute traffic around failures, maintaining uninterrupted operation even when individual links or devices experience issues.
  • Enhanced Efficiency: By distributing load across available paths, MRC maximizes GPU utilization and reduces network congestion.
  • Simplified Management: A single dashboard provides administrators with comprehensive visibility and control over network traffic patterns.
  • Scalability: The architecture enables connections of 100,000+ GPUs using only two layers of Ethernet switches—a significant reduction from the three to four switch layers typically required for 800Gb/s networks.

The technology is already deployed in some of the world’s largest AI training clusters including OpenAI’s facilities and Microsoft’s Fairwater data center, which supports advanced language models like ChatGPT and Codex. Nvidia has integrated its Spectrum-X Ethernet into the MRC framework, demonstrating real-world application of this next-generation networking solution.