process 0 terminated with signal SIGKILL #47

hsb1995 · 2024-04-09T00:40:47Z

I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73

World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]

Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Process finished with exit code 1

hsb1995 · 2024-04-09T00:57:49Z

Package Version

accelerate 0.29.1
aiohttp 3.9.3
aiosignal 1.3.1
appdirs 1.4.4
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.0
black 24.3.0
Brotli 1.1.0
certifi 2022.12.7
charset-normalizer 2.1.1
click 8.1.7
coloredlogs 15.0.1
datasets 2.18.0
decorator 5.1.1
dill 0.3.8
docker-pycreds 0.4.0
exceptiongroup 1.2.0
executing 2.0.1
fastcore 1.5.29
filelock 3.9.0
fire 0.6.0
frozenlist 1.4.1
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.43
hqq 0.1.6.post2
hqq-aten 0.0.0
huggingface-hub 0.22.2
humanfriendly 10.0
idna 3.4
inflate64 1.0.0
ipython 8.23.0
jedi 0.19.1
Jinja2 3.1.2
llama-recipes 0.0.1
loralib 0.1.2
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
multivolumefile 0.2.3
mypy-extensions 1.0.0
networkx 3.2.1
numpy 1.26.3
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11 8.7.0.84
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.19.3
nvidia-nvtx-cu11 11.8.86
optimum 1.18.0
packaging 24.0
pandas 2.2.1
parso 0.8.4
pathspec 0.12.1
peft 0.10.0
pexpect 4.9.0
pillow 10.2.0
pip 23.3.1
platformdirs 4.2.0
prompt-toolkit 3.0.43
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
py7zr 0.21.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pybcj 1.0.2
pycryptodomex 3.20.0
Pygments 2.17.2
pyppmd 1.1.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
pyzstd 0.15.10
regex 2023.12.25
requests 2.28.1
safetensors 0.4.2
scipy 1.13.0
sentencepiece 0.2.0
sentry-sdk 1.44.1
setproctitle 1.3.3
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
stack-data 0.6.3
sympy 1.12
termcolor 2.4.0
texttable 1.7.0
timm 0.9.16
tokenize-rt 5.2.0
tokenizers 0.15.2
tomli 2.0.1
torch 2.2.0+cu118
torchaudio 2.2.0+cu118
torchvision 0.17.0+cu118
tqdm 4.66.2
traitlets 5.14.2
transformers 4.39.3
triton 2.2.0
typing_extensions 4.8.0
tzdata 2024.1
urllib3 1.26.13
wandb 0.16.6
wcwidth 0.2.13
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4

hsb1995 · 2024-04-09T01:10:46Z

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3630 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 4142333 C .../sam/anaconda3/envs/fsdp/bin/python 3990MiB |
| 1 N/A N/A 3630 G /usr/lib/xorg/Xorg 243MiB |
| 1 N/A N/A 3758 G /usr/bin/gnome-shell 9MiB |
| 1 N/A N/A 3249081 G /usr/libexec/gnome-shell-portal-helper 4MiB |
| 1 N/A N/A 4142334 C .../sam/anaconda3/envs/fsdp/bin/python 3968MiB |
+---------------------------------------------------------------------------------------+
I can confirm that when I load the "Loading&Quantizing Model Shards" step, it is indeed multitasking in parallel. But after loading the node, the message "process 0 terminated with signal SIGKILL" appears

hsb1995 · 2024-04-09T03:19:09Z

My small weight can be calculated, but when it comes to large weight, there is a problem.

hsb1995 · 2024-04-09T07:46:40Z

My code runs on dual 3090，Please ask the author to help take a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process 0 terminated with signal SIGKILL #47

process 0 terminated with signal SIGKILL #47

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

process 0 terminated with signal SIGKILL #47

process 0 terminated with signal SIGKILL #47

Comments

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024

hsb1995 commented Apr 9, 2024