CoreScheduler: A High-Performance Scheduling Framework for Large-Scale Model Training in C++

Overview

CoreScheduler is a fully-featured C++ library tailored for efficient and scalable training of large models. It excels in managing asynchronous tasks and dependencies in distributed environments. See introduction for more details.

Notice: This project is currently undergoing rapid development. As such, API stability cannot be guaranteed, particularly for dataset APIs. Additionally, several resources, including documentation, are still in the process of being completed. We appreciate your patience as we work to finalize these materials.

If you have any questions, please feel free to open an issue!

Key Features

Pure C++ Implementation: Optimizes multi-threading and resource management.
Asynchronous Scheduling: Overlaps computation with communication to expedite training.
Advanced Scheduling Capabilities: Enables overlapping of independent GPU computations, significantly enhancing performance.
Communication-Computing Overlap: Efficiently manages data transfer and computation tasks simultaneously to reduce wait times.
Computing-Computing Overlap: Capable of executing multiple computation tasks efficiently, optimizing the use of system capabilities.

Usage

Clone the repository and set up the environment:

git clone [email protected]:TheCoreTeam/core_scheduler.git
cd core_scheduler
conda env create -f env.yaml

Compile & Run Tests

conda activate core_scheduler
mkdir build
cd build
cmake ..
make -j core_scheduler_tests
./test/core_scheduler_tests

Examples

For detailed tutorials, refer to the examples provided in the example folder. Specific guidance on training the GPT-2 model can be found in the GPT-2 training tutorial.

Future Directions

Enhanced Distributed Strategies: Future versions will implement advanced strategies like Zero and 3D parallelism to optimize resource allocation and maximize training efficiency across multiple nodes.
Distributed Fault Tolerance: Develop robust fault tolerance mechanisms to ensure consistent training performance and data integrity across distributed systems, even in the event of partial system failures.
More Advanced Models (e.g., Llama-3, MoE): Expand support for state-of-the-art models including Llama-3 and Mixture of Experts (MoE), enabling cutting-edge research and application in machine learning with enhanced scalability and specialization.

Contributing

We encourage contributions to CoreScheduler. Please visit our issues page for opportunities to contribute.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Authors

See AUTHORS.txt

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github/workflows		.github/workflows
cmake		cmake
example		example
include		include
script		script
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
AUTHORS.txt		AUTHORS.txt
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
env.yaml		env.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoreScheduler: A High-Performance Scheduling Framework for Large-Scale Model Training in C++

Overview

Key Features

Usage

Examples

Future Directions

Contributing

License

Authors

About

Releases

Packages

Languages

License

kaustpradalab/core_scheduler

Folders and files

Latest commit

History

Repository files navigation

CoreScheduler: A High-Performance Scheduling Framework for Large-Scale Model Training in C++

Overview

Key Features

Usage

Examples

Future Directions

Contributing

License

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages