- [2025/01/23] ⚡️ We released Sky-T1-32B-Flash (model, data) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
- [2025/01/19] 🎉 Chat demo for Sky-T1-32B-Preview is alive! Please check it out!
- [2025/01/10] 🎉 We have released our Sky-T1-32B-Preview model and data through HuggingFace!
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
recipes
: Recipes - data curation steps and training strategies - for building our modelsSky-T1-32B-Flash
andSky-T1-32B-Preview
.skythought/skythought_evals
: Our data generation and evaluation library.skythought/train
: Training scripts for Sky-T1. We use Llama-Factory to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing.
First, clone the repository and install the package
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# installs shown for conda
conda create -n eval python==3.10
conda activate eval
pip install -e .
We support a wide variety of datasets in mathematics, science and coding:
- AIME'24
- MATH500
- GPQADiamond
- MMLU
- ARC-Challenge
- OlympiadBench
- AMC'23
- TACO
- APPS
- LiveCodeBench
- MMLU Pro
- MinervaMath
- GSM8K
For running evaluation, please refer to skythought_evals/README.md.
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview |
---|---|---|---|---|
Math500 | 86.4 | 81.4 | 92.2 | 81.4 |
AIME2024 | 43.3 | 16.7 | 50.0 | 40.0 |
LiveCodeBench-Easy | 86.3 | 84.6 | 90.7 | 92.9 |
LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 |
LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 |
GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 |
OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |
We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.
Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ-32B-Preview | Eval Implementation |
---|---|---|---|---|
MMLU (0 shot; no CoT) | 78.36 | 74.14 | 71.23 | lm_eval |
MMLU (5 shot; no CoT) | 82.46 | 82.62 | 82.32 | lm_eval |
ARC-C (0 shot; no CoT) | 49.49 | 49.4 | 49.66 | lm_eval |
IFEval | 75.79 | 78.74 | 42.51 | lm_eval |
LLM-as-a-Judge | 9.12 | 9.19 | 8.30 | fastchat |
MGSM (0 shot; direct ) |
33 | 42.3 | 19.07 | lm_eval |
MGSM (8-shot; direct ) |
58.4 | 61.47 | 58.5 | lm_eval |
BFCL-v3 | 53.18 | 58.92 | 17.41 | BFCL |
Arena-Hard | 74.79 | 66.51 | 52.6 | Arena-Hard-Auto |
For more details, refer here.
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results easily:
Model | Sky-T1-32B-Preview |
STILL-2 |
Journey |
QwQ |
o1 |
---|---|---|---|---|---|
Data | ✅ |
✅ |
❌ |
❌ |
❌ |
Code | ✅ |
❌ |
❌ |
❌ |
❌ |
Report | ✅ |
✅ |
✅ |
❌ |
❌ |
Math domain | ✅ |
✅ |
✅ |
✅ |
✅ |
Coding domain | ✅ |
❌ |
❌ |
✅ |
✅ |
Model Weights | ✅ |
✅ |
❌ |
✅ |
❌ |
The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.
@misc{sky_t1_2025,
author = {NovaSky Team},
title = {Sky-T1: Train your own O1 preview model within $450},
howpublished = {https://novasky-ai.github.io/posts/sky-t1},
note = {Accessed: 2025-01-09},
year = {2025}
}
This work is done at Berkeley Sky Computing Lab, with the amazing compute support from Lambda Labs, Anyscale, and Databricks. We would like to express our gratitude for the valuable academic feedback and support from the Still-2 Team, and Junyang Lin from the Qwen Team.