Error when running run_gemini.sh under ColossalAI/examples/language/gpt/gemini/ #3649
Replies: 1 comment
-
Need newer colossalai |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
An error occurred when I tried to run script run_gemini.sh under ColossalAI/examples/language/gpt/gemini
Log:
please install Colossal-AI from https://www.colossalai.org/download or from source
Traceback (most recent call last):
File "/workspace/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 352, in
main()
File "/workspace/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 185, in main
assert version.parse(CAI_VERSION) >= version.parse("0.2.0")
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 61) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./train_gpt_demo.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-04-26_09:31:39
host : c95a3316a474
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 61)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
CMD:
bash run_gemini.sh
Any idea about that?
Beta Was this translation helpful? Give feedback.
All reactions