China Telecom Completes Pilot Commercial Deployment of LLM Distributed Training over 500 km

Details: Staff Writer; Category: Data Centres & Networks; 17 July 2025; 16952 views

China Telecom has announced the industry's first pilot commercial deployment of distributed training for a large language model (LLM) with 177 billion (177B) parameters, using 1024 GPUs over a 500 km field-deployed fiber link.

This provides a new solution for the coordinated development of AI infrastructure.

The core challenge of this pilot scheme lies in achieving collaborative computing over long-haul, high-bandwidth, and low-latency connections. Leveraging the self-developed technology "wide-area lossless intelligent interconnection network", China Telecom has established a 500km optical path loopback network between Wuqing in Tianjin and Yinghai in Beijing. The result shows that under different LLM and collective communication modes, the training performance of multiple intelligent computing centers across regions reaches 97% to 99% of that of LLM training in a single data center.

The key technical highlight of this field trial is the use of the 800G wide-area lossless transport technology, featuring a bandwidth convergence ratio of 32:1 (computing-side bandwidth of 102.4 Tbit/s, and transmission link bandwidth of 3.2 Tbit/s), as well as the introduced pipeline parallel (PP) and data parallel (DP) strategies. This effectively eliminates the packet loss problem caused by network congestion during long-haul transmission. Additionally, the deployment of telecom-grade sub-50 ms Wavelength Switched Optical Network (WSON), enabled by innovations in protocol processing, ultra-fast WSS switching, and DSP reconstruction technologies, ensured seamless 50 ms link failover, maintaining training continuity and system stability.

On this basis, the field trial also uses the “Xirang” integrated intelligent computing platform, supporting cross-regional computing-network coordination, automated model parallelism, and checkpoint-based resumable training. The platform delivers second-level fault localization and minute-level recovery, significantly enhancing the efficiency of commercial model deployment.

Due to the exponential growth in computing power required for AI LLM training, traditional single data center faces bottlenecks due to physical space, energy costs, and geographical constraints. China Telecom's achievement demonstrates the transformative value of cross-regional computing integration, effectively turning distributed data centers into a "virtual supercomputer". This approach significantly reduces training costs and provides a feasible technical approach for national initiatives such as "East Data, West Computing".

The success of this pilot scheme represents a major step forward in the innovation of China Telecom's intelligent computing networks, demonstrating, exemplifying its commitment to supporting the national strategy of coordinated compute-network development. Looking ahead, China Telecom will continue to invest in the research and development of intelligent network, providing strong network support for AI advancement and contributing to the high-quality growth of China's digital economy.