@@ -78,49 +78,47 @@ We can launch it with the following:
78
78
79
79
.. code-block :: console
80
80
81
+ $ git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
81
82
$ sky jobs launch -n bert-qa bert_qa.yaml
82
83
83
-
84
84
.. code-block :: yaml
85
85
86
86
# bert_qa.yaml
87
87
name : bert-qa
88
88
89
89
resources :
90
90
accelerators : V100:1
91
- # Use spot instances to save cost.
92
- use_spot : true
93
-
94
- # Assume your working directory is under `~/transformers`.
95
- # To make this example work, please run the following command:
96
- # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
97
- workdir : ~/transformers
91
+ use_spot : true # Use spot instances to save cost.
98
92
99
- setup : |
93
+ envs :
100
94
# Fill in your wandb key: copy from https://wandb.ai/authorize
101
95
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
102
96
# to pass the key in the command line, during `sky jobs launch`.
103
- echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
97
+ WANDB_API_KEY :
98
+
99
+ # Assume your working directory is under `~/transformers`.
100
+ workdir : ~/transformers
104
101
102
+ setup : |
105
103
pip install -e .
106
104
cd examples/pytorch/question-answering/
107
105
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
108
106
pip install wandb
109
107
110
108
run : |
111
- cd ./ examples/pytorch/question-answering/
109
+ cd examples/pytorch/question-answering/
112
110
python run_qa.py \
113
- --model_name_or_path bert-base-uncased \
114
- --dataset_name squad \
115
- --do_train \
116
- --do_eval \
117
- --per_device_train_batch_size 12 \
118
- --learning_rate 3e-5 \
119
- --num_train_epochs 50 \
120
- --max_seq_length 384 \
121
- --doc_stride 128 \
122
- --report_to wandb
123
-
111
+ --model_name_or_path bert-base-uncased \
112
+ --dataset_name squad \
113
+ --do_train \
114
+ --do_eval \
115
+ --per_device_train_batch_size 12 \
116
+ --learning_rate 3e-5 \
117
+ --num_train_epochs 50 \
118
+ --max_seq_length 384 \
119
+ --doc_stride 128 \
120
+ --report_to wandb \
121
+ --output_dir /tmp/bert_qa/
124
122
125
123
.. note ::
126
124
@@ -162,55 +160,52 @@ An End-to-End Example
162
160
Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml >`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.
163
161
164
162
.. code-block :: yaml
165
- :emphasize-lines : 13-16,42-45
163
+ :emphasize-lines : 8-11,41-44
166
164
167
165
# bert_qa.yaml
168
166
name : bert-qa
169
167
170
168
resources :
171
169
accelerators : V100:1
172
- use_spot : true
173
-
174
- # Assume your working directory is under `~/transformers`.
175
- # To make this example work, please run the following command:
176
- # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
177
- workdir : ~/transformers
170
+ use_spot : true # Use spot instances to save cost.
178
171
179
172
file_mounts :
180
173
/checkpoint :
181
174
name : # NOTE: Fill in your bucket name
182
175
mode : MOUNT
183
176
184
- setup : |
177
+ envs :
185
178
# Fill in your wandb key: copy from https://wandb.ai/authorize
186
179
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
187
180
# to pass the key in the command line, during `sky jobs launch`.
188
- echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
181
+ WANDB_API_KEY :
182
+
183
+ # Assume your working directory is under `~/transformers`.
184
+ workdir : ~/transformers
189
185
186
+ setup : |
190
187
pip install -e .
191
188
cd examples/pytorch/question-answering/
192
- pip install -r requirements.txt
189
+ pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
193
190
pip install wandb
194
191
195
192
run : |
196
- cd ./ examples/pytorch/question-answering/
193
+ cd examples/pytorch/question-answering/
197
194
python run_qa.py \
198
- --model_name_or_path bert-base-uncased \
199
- --dataset_name squad \
200
- --do_train \
201
- --do_eval \
202
- --per_device_train_batch_size 12 \
203
- --learning_rate 3e-5 \
204
- --num_train_epochs 50 \
205
- --max_seq_length 384 \
206
- --doc_stride 128 \
207
- --report_to wandb \
208
- --run_name $SKYPILOT_TASK_ID \
209
- --output_dir /checkpoint/bert_qa/ \
210
- --save_total_limit 10 \
211
- --save_steps 1000
212
-
213
-
195
+ --model_name_or_path bert-base-uncased \
196
+ --dataset_name squad \
197
+ --do_train \
198
+ --do_eval \
199
+ --per_device_train_batch_size 12 \
200
+ --learning_rate 3e-5 \
201
+ --num_train_epochs 50 \
202
+ --max_seq_length 384 \
203
+ --doc_stride 128 \
204
+ --report_to wandb \
205
+ --output_dir /checkpoint/bert_qa/ \
206
+ --run_name $SKYPILOT_TASK_ID \
207
+ --save_total_limit 10 \
208
+ --save_steps 1000
214
209
215
210
As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
216
211
the output directory and frequency of checkpointing (see more
0 commit comments