Skip to content

Commit

Permalink
Merge pull request #34 from bofenghuang/next
Browse files Browse the repository at this point in the history
Update to version 2.2
  • Loading branch information
bofenghuang authored Oct 20, 2023
2 parents 76e1cd0 + 80bfbcc commit 72b96b5
Show file tree
Hide file tree
Showing 75 changed files with 6,531 additions and 71,140 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ Our project builds upon the following open-source projects for further developme
- [Alpaca-LoRA by @tloen](https://github.com/tloen/alpaca-lora)
- [Baize](https://github.com/project-baize/baize-chatbot)
- [llama.cpp by @ggerganov](https://github.com/ggerganov/llama.cpp)
- [Axolotl by @OpenAccess-AI-Collective](https://github.com/OpenAccess-AI-Collective/axolotl)

## Citation

Expand Down
51,708 changes: 0 additions & 51,708 deletions data/chat/converted_alpaca_data_cleaned_fr_52k.jsonl

This file was deleted.

15,003 changes: 0 additions & 15,003 deletions data/chat/converted_dolly_bactrian_fr_15k.jsonl

This file was deleted.

20 changes: 20 additions & 0 deletions data/chat/dummy_chat.jsonl

Large diffs are not rendered by default.

512 changes: 256 additions & 256 deletions data/chat/oasst_20230412_fr_top1.jsonl

Large diffs are not rendered by default.

1,648 changes: 0 additions & 1,648 deletions data/chat/sg_fr.jsonl

This file was deleted.

1,385 changes: 1,385 additions & 0 deletions data/chat/sharegpt_fr.jsonl

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ You can use the following script to generate the instruction-following data:
export OPENAI_API_KEY=YOUR/OPENAI/API/TOKEN

# num_instructions_to_generate is by worker
python scripts/data_generation/generate_instructions.py \
python scripts/data_generation/generate_self_instruct.py \
--seed_tasks_path data/instruct/seed_tasks_vigogne.jsonl \
--prompt_path data/instruct/prompt_vigogne.txt \
--output_file data/instruct/self_instruct_data.jsonl \
Expand Down Expand Up @@ -116,9 +116,9 @@ Below is an example of a script we used to provide some translated subjects in [
# Specify your OpenAI API key
export OPENAI_API_KEY=YOUR/OPENAI/API/TOKEN

python scripts/data_generation/generate_conversations.py \
--input_json_file data/chat/subject_quora_fr_nllb3b3.jsonl \
--output_json_file data/chat/self_chat_data_quora_fr.jsonl \
python scripts/data_generation/generate_self_chat.py \
--input_file data/chat/subject_quora_fr_nllb3b3.jsonl \
--output_file data/chat/self_chat_data_quora_fr.jsonl \
--subject_field translated_subject \
--output_subject_field subject \
--id_prefix self-chat-quora- \
Expand Down Expand Up @@ -169,8 +169,8 @@ Next, you can generate responses using the following script. Please note that th
export OPENAI_API_KEY=YOUR/OPENAI/API/TOKEN

python scripts/data_generation/generate_responses.py \
--input_json_file path/to/flanv2_translated.jsonl \
--output_json_file path/to/flanv2_translated_completed.jsonl \
--input_file path/to/flanv2_translated.jsonl \
--output_file path/to/flanv2_translated_completed.jsonl \
--system_field system_prompt \
--instruction_field translated_question \
--response_field fr_response \
Expand Down
8 changes: 5 additions & 3 deletions docs/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@ You can access the weights for these models on the 🤗 Hugging Face Hub. For fu

Here is a list of recommended models for this project. These models have been trained using more diverse and higher-quality data, along with an optimized training process. It is advisable to use these models as a priority for your project. For alternative models, please refer to the [Other Models](#other-models) section.

| Model | Type | Foundation model | Data | Description |
| :----------------------------------------------------------------------------: | :---: | :-------------------------------------: | :------------: | :----------------------------------------------------------------------------------------------------------------------------: |
| [Vigogne-2-7B-Chat-V2.0](https://huggingface.co/bofenghuang/vigogne-2-7b-chat) | Chat | [Llama-2-7B](https://ai.meta.com/llama) | 520K chat data | Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details. |
| Model | Type | Foundation model | Data | Description |
| :----------------------------------------------------------------------------: | :---: | :--------------------------------------------------------------: | :------------: | :----------------------------------------------------------------------------------------------------------------------------: |
| [Vigostral-7B-Chat](https://huggingface.co/bofenghuang/vigostral-7b-chat) | Chat | [Mistral-7B-v0.1](https://mistral.ai/news/announcing-mistral-7b) | | |
| [Vigogne-2-7B-Chat-V2.0](https://huggingface.co/bofenghuang/vigogne-2-7b-chat) | Chat | [Llama-2-7B](https://ai.meta.com/llama) | 520K chat data | Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details. |
| [Vigogne-2-13B-Chat](https://huggingface.co/bofenghuang/vigogne-2-13b-chat) | Chat | [Llama-2-13B](https://ai.meta.com/llama) | 520K chat data | Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details. |

### Legacy Models

Expand Down
39 changes: 1 addition & 38 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,41 +10,4 @@ We highly recommend the utilization of tools such as [DeepSpeed](https://github.

More examples can be found in [examples](https://github.com/bofenghuang/vigogne/blob/main/examples/train).

The following command shows how to fine-tune the Llama 2 7B model on a single GPU using LoRA and LLM.int8().

```bash
python vigogne/train/train_sft.py \
--model_name_or_path "meta-llama/Llama-2-7b-hf" \
--train_file "/path/to/train/instruct/file.jsonl" \
--output_dir "outputs/llama-2-7b-sft-instruct-lora-int8" \
--overwrite_output_dir \
--mode "instruct" \
--preprocessing_num_workers "8" \
--dataloader_num_workers "1" \
--pack_into_block \
--block_size "2048" \
--load_in_8bit \
--lora_r "64" \
--lora_alpha "16" \
--lora_dropout "0.05" \
--target_modules "q_proj" "v_proj" "k_proj" "o_proj" "gate_proj" "up_proj" "down_proj" \
--per_device_train_batch_size "8" \
--per_device_eval_batch_size "4" \
--num_train_epochs "3" \
--learning_rate "1e-4" \
--warmup_ratio "0.03" \
--lr_scheduler_type "cosine" \
--weight_decay "0" \
--torch_compile \
--fp16 \
--gradient_checkpointing \
--ddp_find_unused_parameters false \
--log_level "info" \
--logging_steps "10" \
--logging_first_step true \
--save_strategy "steps" \
--save_steps "100" \
--save_total_limit "3" \
--report_to "tensorboard" "wandb" \
--do_train
```
Since version 2.2, I've refactored the training code, integrating specific elements inspired by the excellent training framework [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). Thanks to the Axolotl team for their contributions to the open-source community! The primary motivation behind maintaining my own framework is to have full control over the entire training process and customize it to my specific needs. I highly recommend using Axolotl for additional features.
Empty file modified examples/inference/run_fastchat_cli.sh
100644 → 100755
Empty file.
Empty file modified examples/inference/run_gradio_demo.sh
100644 → 100755
Empty file.
Empty file modified examples/inference/run_llama_cpp.sh
100644 → 100755
Empty file.
3 changes: 3 additions & 0 deletions examples/inference/run_server_vllm.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,6 @@ export CUDA_VISIBLE_DEVICES="0"
python -m vllm.entrypoints.openai.api_server \
--model bofenghuang/vigogne-2-7b-chat \
--host "0.0.0.0"

# Then send request using the following script
# python vigogne/inference/vllm/client_openai_chatcompletion.py
76 changes: 76 additions & 0 deletions examples/train/falcon/train_sft_chat_qlora.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env bash
# Copyright 2023 Bofeng Huang

# Train chat models using QLoRA (int4)

export WANDB_PROJECT="llm-sft-chat"
export OMP_NUM_THREADS="1"
export TOKENIZERS_PARALLELISM="false"
export BITSANDBYTES_NOWELCOME="1"
# export CUDA_VISIBLE_DEVICES="0"

# Model
model_name_or_path=tiiuae/falcon-7b
# model_name_or_path=tiiuae/falcon-40b

# Dataset
# Customize dataset here
train_file=data/chat/oasst_20230412_fr_top1.jsonl
model_max_length=2048

# Outdir
run_name=falcon-7b-sft-chat-qlora
output_dir=outputs/$run_name

# Might need to adjust the batch size and other hyperparameters by yourself
per_device_train_batch_size=8
per_device_eval_batch_size=4
gradient_accumulation_steps=8

# Further optimization
# DeepSpeed Stage 2
# --deepspeed vigogne/configs/ds_config_zero2_no_offload.json \

torchrun \
vigogne/cli/train_sft.py \
--model_name_or_path $model_name_or_path \
--tokenizer_use_fast false \
--tokenizer_padding_side "right" \
--add_special_tokens '{"bos_token":">>ABSTRACT<<","pad_token":"<|endoftext|>"}' \
--train_file $train_file \
--output_dir $output_dir \
--overwrite_output_dir \
--run_name $run_name \
--processor_style "vigogne_chat_v3" \
--model_max_length $model_max_length \
--eval_split_ratio "0.01" \
--preprocessing_num_workers "8" \
--dataloader_num_workers "1" \
--adapter "qlora" \
--load_in_4bit \
--optim "paged_adamw_32bit" \
--lora_r "64" \
--lora_alpha "16" \
--lora_dropout "0.05" \
--lora_target_all_linear_layers \
--do_merge_lora \
--num_train_epochs "3" \
--per_device_train_batch_size $per_device_train_batch_size \
--per_device_eval_batch_size $per_device_eval_batch_size \
--gradient_accumulation_steps $gradient_accumulation_steps \
--learning_rate "1e-4" \
--warmup_ratio "0.03" \
--lr_scheduler_type "cosine" \
--weight_decay "0" \
--fp16 \
--gradient_checkpointing \
--ddp_find_unused_parameters false \
--log_level "info" \
--logging_steps "1" \
--logging_first_step \
--save_strategy "steps" \
--save_steps "100" \
--save_total_limit "3" \
--evaluation_strategy "steps" \
--eval_steps "100" \
--report_to "tensorboard" "wandb"
Original file line number Diff line number Diff line change
@@ -1,58 +1,65 @@
#!/usr/bin/env bash
# Copyright 2023 Bofeng Huang

export WANDB_PROJECT="llm-sft-chat-fr"
# Train chat models using full fine-tuning + DeepSpeed Stage 3

export WANDB_PROJECT="llm-sft-chat"
export OMP_NUM_THREADS="1"
export TOKENIZERS_PARALLELISM="false"
export BITSANDBYTES_NOWELCOME="1"
export CUDA_VISIBLE_DEVICES="0,1,2,3"

train_file=/path/to/train/chat/file.jsonl
# Model
model_name_or_path=meta-llama/Llama-2-7b-hf

mode=chat
# Dataset
# Customize dataset here
train_file=data/chat/oasst_20230412_fr_top1.jsonl
model_max_length=2048

model_name_or_path=meta-llama/Llama-2-7b-hf
output_dir=outputs/llama-2-7b-sft-chat-lora-int8
# Outdir
run_name=llama-2-7b-sft-chat-fullfinetune
output_dir=outputs/$run_name

per_device_train_batch_size=8
# Might need to adjust the batch size and other hyperparameters by yourself
per_device_train_batch_size=4
per_device_eval_batch_size=2
gradient_accumulation_steps=4

# Might need to adjust the batch size and other hyperparameters by yourself
torchrun \
--nproc_per_node 4 \
vigogne/train/train_sft.py \
vigogne/cli/train_sft.py \
--deepspeed vigogne/configs/ds_config_zero3_no_offload.json \
--model_name_or_path $model_name_or_path \
--tokenizer_use_fast false \
--tokenizer_padding_side "right" \
--train_file $train_file \
--output_dir $output_dir \
--overwrite_output_dir \
--mode $mode \
--run_name $run_name \
--processor_style "vigogne_chat_v3" \
--model_max_length $model_max_length \
--eval_split_ratio "0.01" \
--preprocessing_num_workers "8" \
--dataloader_num_workers "1" \
--pack_into_block \
--block_size "2048" \
--load_in_8bit \
--lora_r "64" \
--lora_alpha "16" \
--lora_dropout "0.05" \
--target_modules "q_proj" "v_proj" "k_proj" "o_proj" "gate_proj" "up_proj" "down_proj" \
--num_train_epochs "3" \
--per_device_train_batch_size $per_device_train_batch_size \
--per_device_eval_batch_size $per_device_eval_batch_size \
--gradient_accumulation_steps $gradient_accumulation_steps \
--num_train_epochs "3" \
--learning_rate "1e-4" \
--optim "adamw_bnb_8bit" \
--learning_rate "2.5e-5" \
--warmup_ratio "0.03" \
--lr_scheduler_type "cosine" \
--weight_decay "0" \
--torch_compile \
--fp16 \
--gradient_checkpointing \
--ddp_find_unused_parameters false \
--log_level "info" \
--logging_steps "10" \
--logging_first_step true \
--logging_steps "1" \
--logging_first_step \
--save_strategy "steps" \
--save_steps "100" \
--save_total_limit "3" \
--report_to "tensorboard" "wandb" \
--do_train
--evaluation_strategy "steps" \
--eval_steps "100" \
--report_to "tensorboard" "wandb"
Original file line number Diff line number Diff line change
@@ -1,58 +1,76 @@
#!/usr/bin/env bash
# Copyright 2023 Bofeng Huang

export WANDB_PROJECT="llm-sft-instruct-fr"
# Train chat models using LoRA

export WANDB_PROJECT="llm-sft-chat"
export OMP_NUM_THREADS="1"
export TOKENIZERS_PARALLELISM="false"
export BITSANDBYTES_NOWELCOME="1"
export CUDA_VISIBLE_DEVICES="0,1,2,3"
# export CUDA_VISIBLE_DEVICES="0"

train_file=/path/to/train/instruct/file.jsonl
# Model
model_name_or_path=meta-llama/Llama-2-7b-hf

mode=instruct
# Dataset
# Customize dataset here
train_file=data/chat/oasst_20230412_fr_top1.jsonl
model_max_length=2048

model_name_or_path=meta-llama/Llama-2-7b-hf
output_dir=outputs/llama-2-7b-sft-instruct-lora-int8
# Outdir
run_name=llama-2-7b-sft-chat-lora
output_dir=outputs/$run_name

# Might need to adjust the batch size and other hyperparameters by yourself
per_device_train_batch_size=8
gradient_accumulation_steps=4
per_device_eval_batch_size=4
gradient_accumulation_steps=8

# Further optimization
# DeepSpeed Stage 2
# --deepspeed vigogne/configs/ds_config_zero2_no_offload.json \
# LLM.int8()
# --load_in_8bit \
# 8bit optimizer
# --optim "adamw_bnb_8bit" \

# Might need to adjust the batch size and other hyperparameters by yourself
torchrun \
--nproc_per_node 4 \
vigogne/train/train_sft.py \
vigogne/cli/train_sft.py \
--model_name_or_path $model_name_or_path \
--tokenizer_use_fast false \
--tokenizer_padding_side "right" \
--train_file $train_file \
--output_dir $output_dir \
--overwrite_output_dir \
--mode $mode \
--run_name $run_name \
--processor_style "vigogne_chat_v3" \
--model_max_length $model_max_length \
--eval_split_ratio "0.01" \
--preprocessing_num_workers "8" \
--dataloader_num_workers "1" \
--pack_into_block \
--block_size "2048" \
--load_in_8bit \
--adapter "lora" \
--lora_r "64" \
--lora_alpha "16" \
--lora_dropout "0.05" \
--target_modules "q_proj" "v_proj" "k_proj" "o_proj" "gate_proj" "up_proj" "down_proj" \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" "gate_proj" "up_proj" "down_proj" \
--do_merge_lora \
--num_train_epochs "3" \
--per_device_train_batch_size $per_device_train_batch_size \
--per_device_eval_batch_size $per_device_eval_batch_size \
--gradient_accumulation_steps $gradient_accumulation_steps \
--num_train_epochs "3" \
--learning_rate "1e-4" \
--warmup_ratio "0.03" \
--lr_scheduler_type "cosine" \
--weight_decay "0" \
--torch_compile \
--fp16 \
--gradient_checkpointing \
--ddp_find_unused_parameters false \
--log_level "info" \
--logging_steps "10" \
--logging_first_step true \
--logging_steps "1" \
--logging_first_step \
--save_strategy "steps" \
--save_steps "100" \
--save_total_limit "3" \
--report_to "tensorboard" "wandb" \
--do_train
--evaluation_strategy "steps" \
--eval_steps "100" \
--report_to "tensorboard" "wandb"
Loading

0 comments on commit 72b96b5

Please sign in to comment.