Summary
Foundation models across various domains are experiencing rapid growth, necessitating continuous expansion to enhance performance. However, training these Large Language Models (LLMs) not only demands significant resources but also relies on a robust and dependable system to ensure an effective training process.
Algorithm engineers face numerous challenges when training realistic LLMs, including server crashes, hardware failures, software compatibility issues, network communication errors, and unknown hangs. These failures result in the loss of training output and necessitate multiple restarts, consuming extra time and resources. For instance, launching the training process for a 175B model in a distributed environment requires several hours, occupying a substantial portion of the total training stage, which many researchers find financially burdensome.
Therefore, establishing a robust and dependable platform to support the entire lifecycle of LLM development is not only complex and challenging but also urgently required.
The project aims to explore and develop a resilient deep learning framework, investigating its scientific foundation, to enhance the LLM development lifecycle, with a specific focus on failover perspectives. The system is designed to tolerate any worker's crash or failure without impacting its overall execution. The automatic failover process, transparent to upper-level users, efficiently restarts and re-initializes failed workers based on soft or hard states. Given the novelty of this research, students are encouraged and supported to publish ground-breaking papers at top-tier conferences and even explore technical patents for potential startups.
