Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process 0 terminated with signal SIGKILL #47

Open
hsb1995 opened this issue Apr 9, 2024 · 4 comments
Open

process 0 terminated with signal SIGKILL #47

hsb1995 opened this issue Apr 9, 2024 · 4 comments

Comments

@hsb1995
Copy link

hsb1995 commented Apr 9, 2024

I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73

World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]

Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Process finished with exit code 1

@hsb1995
Copy link
Author

hsb1995 commented Apr 9, 2024

Package Version


accelerate 0.29.1
aiohttp 3.9.3
aiosignal 1.3.1
appdirs 1.4.4
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.0
black 24.3.0
Brotli 1.1.0
certifi 2022.12.7
charset-normalizer 2.1.1
click 8.1.7
coloredlogs 15.0.1
datasets 2.18.0
decorator 5.1.1
dill 0.3.8
docker-pycreds 0.4.0
exceptiongroup 1.2.0
executing 2.0.1
fastcore 1.5.29
filelock 3.9.0
fire 0.6.0
frozenlist 1.4.1
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.43
hqq 0.1.6.post2
hqq-aten 0.0.0
huggingface-hub 0.22.2
humanfriendly 10.0
idna 3.4
inflate64 1.0.0
ipython 8.23.0
jedi 0.19.1
Jinja2 3.1.2
llama-recipes 0.0.1
loralib 0.1.2
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
multivolumefile 0.2.3
mypy-extensions 1.0.0
networkx 3.2.1
numpy 1.26.3
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11 8.7.0.84
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.19.3
nvidia-nvtx-cu11 11.8.86
optimum 1.18.0
packaging 24.0
pandas 2.2.1
parso 0.8.4
pathspec 0.12.1
peft 0.10.0
pexpect 4.9.0
pillow 10.2.0
pip 23.3.1
platformdirs 4.2.0
prompt-toolkit 3.0.43
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
py7zr 0.21.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pybcj 1.0.2
pycryptodomex 3.20.0
Pygments 2.17.2
pyppmd 1.1.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
pyzstd 0.15.10
regex 2023.12.25
requests 2.28.1
safetensors 0.4.2
scipy 1.13.0
sentencepiece 0.2.0
sentry-sdk 1.44.1
setproctitle 1.3.3
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
stack-data 0.6.3
sympy 1.12
termcolor 2.4.0
texttable 1.7.0
timm 0.9.16
tokenize-rt 5.2.0
tokenizers 0.15.2
tomli 2.0.1
torch 2.2.0+cu118
torchaudio 2.2.0+cu118
torchvision 0.17.0+cu118
tqdm 4.66.2
traitlets 5.14.2
transformers 4.39.3
triton 2.2.0
typing_extensions 4.8.0
tzdata 2024.1
urllib3 1.26.13
wandb 0.16.6
wcwidth 0.2.13
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4

@hsb1995
Copy link
Author

hsb1995 commented Apr 9, 2024

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3630 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 4142333 C .../sam/anaconda3/envs/fsdp/bin/python 3990MiB |
| 1 N/A N/A 3630 G /usr/lib/xorg/Xorg 243MiB |
| 1 N/A N/A 3758 G /usr/bin/gnome-shell 9MiB |
| 1 N/A N/A 3249081 G /usr/libexec/gnome-shell-portal-helper 4MiB |
| 1 N/A N/A 4142334 C .../sam/anaconda3/envs/fsdp/bin/python 3968MiB |
+---------------------------------------------------------------------------------------+
I can confirm that when I load the "Loading&Quantizing Model Shards" step, it is indeed multitasking in parallel. But after loading the node, the message "process 0 terminated with signal SIGKILL" appears

@hsb1995
Copy link
Author

hsb1995 commented Apr 9, 2024

image
My small weight can be calculated, but when it comes to large weight, there is a problem.

@hsb1995
Copy link
Author

hsb1995 commented Apr 9, 2024

image
My code runs on dual 3090,Please ask the author to help take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant