-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process 0 terminated with signal SIGKILL #47
Comments
Package Version accelerate 0.29.1 |
+---------------------------------------------------------------------------------------+ |
I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73
World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]
Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Process finished with exit code 1
The text was updated successfully, but these errors were encountered: