-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] feat: Add multi-turn SFT support #195
Draft
xingyaoww
wants to merge
10
commits into
volcengine:main
Choose a base branch
from
xingyaoww:feature/multi-turn-sft-dataset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[WIP] feat: Add multi-turn SFT support #195
xingyaoww
wants to merge
10
commits into
volcengine:main
from
xingyaoww:feature/multi-turn-sft-dataset
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add MultiTurnSFTDataset class for handling multi-turn conversations - Support different roles (system, user, assistant) with role-specific prefixes - Set loss mask to 1 for assistant responses only - Add comprehensive test suite for the new dataset class
- Replace custom chat formatting with HuggingFace chat template - Use Qwen tokenizer for testing - Fix tensor indexing and loss mask generation - Update test to verify proper tokenization
- Use HuggingFace chat template instead of custom formatting - Add comprehensive tests for loss mask behavior - Verify both assistant and non-assistant content - Add debug output for test failures
- Add separate workflow for unit tests - Run tests in tests/soft directory - Generate and upload coverage reports - Use same container as e2e tests
- Move tests from tests/soft to tests/sft/unit for consistency - Update CI workflow paths - Keep all SFT-related tests under tests/sft
- Update trainer to support both single-turn and multi-turn datasets - Add example script for multi-turn training - Add data preprocessing script for multi-turn conversations - Use proper chat template for multi-turn data
- Add use_multiturn flag (default: false) - Add messages_key for multi-turn mode (default: messages) - Group single-turn and multi-turn settings
- Add OpenHands SFT dataset preprocessing script - Add token length limit (32k) for conversations - Move multi-turn example to tests/sft - Add train/test split and statistics
xingyaoww
changed the title
feat: Add multi-turn SFT support
[WIP] feat: Add multi-turn SFT support
Feb 4, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Multi-turn Conversation Fine-tuning Support
Overview
This PR adds support for fine-tuning models on multi-turn conversations, including proper chat template handling and loss masking for assistant responses. The implementation includes support for the OpenHands SFT dataset and handles conversations up to 32k tokens.
Key Features
MultiTurnSFTDataset
class for handling multi-turn conversationsapply_chat_template
Implementation Details
Dataset:
Training:
use_multiturn
flag in configmessages_key
for multi-turn data formatExamples and Tests:
Usage Example
##Testing
Documentation
ok.. this is another PR mostly done by OpenHands with me messaging it about 10 times..
It is still WIP as I'm testing it on a training job now, will report back if it works