Connection error #23

yone456 · 2023-11-02T05:36:40Z

Hello! I tried an experiment using the llama2 13b model and got a CONNECTION ERROR.

RL script

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

LLM server

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=1 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

The following error occurred when starting the LLM server after running it using the above command.

ConnectionError: Tried to launch distributed communication on port 30004, but another process is utilizing it. Please specify a different port (such as using the ----main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.

Could you please advise me on how to resolve the error?

The text was updated successfully, but these errors were encountered:

ClementRomac · 2023-11-02T09:37:50Z

Hi,

I encounter the same error when I launch the RL script first. It appears there is a conflict of master processes when manually launching two processes on the same machine.

I will investigate this. In the meantime there are 2 solutions for you:

Let torch launch the two processes: set num_machines:1 in your config and only launch your RL script as you did in the example above (the lamorel_launcher will ask torch to launch the two processes)
Keep on launching the two processes manually but launch the LLM server first.

yone456 · 2023-11-06T19:06:03Z

Thanks for the advice.
I was able to run it thanks to your advice.
I was able to run the Flan-T5 and other models, but not with regard to llama and llama2.
Will you ever cover the llama and llama2 models in the future?

ClementRomac · 2023-11-08T11:14:51Z

What is the matter with Llama?

For your information, I am currently working on adding a couple of things (along with several fixes):

Quantization (the possibility to load models in 4 bits)
Better caching, especially for decoder-only models (which also simplifies how to add custom module functions on top of them)
Better support of decoder-only models

With these improvements, I am able to run and train (with QLoRA) models like Llama2, OPT or Mistral.

These should arrive shortly here (in the coming weeks).

yone456 · 2023-11-09T16:25:43Z

That news is great news for me.
I am very much looking forward to the arrival of those features.

By the way, I get the following error when I start llama2.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6.60it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.31s/it]
Using pad_token, but it is not set yet.
trainable params: 19401729 || all params: 13035266049 || trainable%: 0.14884029928555545
Error executing job with overrides: ['lamorel_args.accelerate_args.machine_rank=1']
Traceback (most recent call last):
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 393, in
main()
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 255, in main
lm_server = Caller(config_args.lamorel_args,
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in init
Server(
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 58, in init
DDP(self._model, process_group=self._llm_group,
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in init
self._log_and_throw(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

ClementRomac · 2023-11-10T15:48:18Z

Which model are you using exactly? Also, what are your versions of transformers and accelerate?

yone456 · 2023-11-12T15:34:10Z

The version I am using is the following version.

accelerate 0.21.0
transformers 4.33.0

The model used is Llama-2-13b-hf.
https://huggingface.co/meta-llama/Llama-2-13b-hf

ClementRomac · 2023-11-13T15:01:03Z

Ok so first to give more details about your initial issue with Connection Error, it's Accelerate that checks the asked port isn't already in use. When a process with a rank > 0 is launched first, the port isn't already in use (at it is first) AND torch distributed doesn't launch anything on this port as only the process with rank=0 should launch the master process. So then when you launch the process with rank=0, the port is still free and everything runs smoothly. However, when you do the opposite, the process with rank=0 (which is launched first) starts the main process listening on the asked port, but Accelerate still checks for the second process with rank > 0 that the port is free.

I guess this check should take into account the rank of the current process. I haven't opened any issue yet as manually launching two "machines" on the same machine isn't really a "normal" use case of Accelerate. So I would advise setting the num_machines:1.

Concerning Llama, this is surprising as it seems the piece of code putting the LLM's weights on a CUDA device is not working as expected and your LLM is still on the fake 'meta' device when passed to DDP. Could you try upgrading Accelerate?

ClementRomac · 2023-11-19T19:42:05Z

It may also be related to your pytorch version. See #24.

yone456 · 2023-11-19T19:56:28Z

Thanks to your advice the error was avoided. Thank you very much.

Sorry, I have about two questions.
When using a decoder only model (causal) like llama2, I get the following error in main.py of PPO_LoRA_finetuning. Is there any workaround for this error?

File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in init
self.run()
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run
current_process_results = self._process_calls(calls_to_process)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls
llm_results.append(self._model(_call))
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 285, in forward
results = _fn(_outputs,
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 39, in forward
raise NotImplementedError()
NotImplementedError

Also, how can I do fine tuning with multiple GPUs in PPO_LoRA_finetuning?

ClementRomac · 2023-11-20T06:59:29Z

Hi,

Decoder-Only support is part of the multiple changes I have to push. This update will be added in a PR tomorrow morning. Examples will also be slightly modified, so you may have to adapt your code.

Concerning multi-GPU, if you have set lamorel_args.llm_args.parallelism.use_gpu=true, you should be able to set how many GPUs each LLM uses with lamorel_args.llm_args.parallelism.model_parallelism_size. Example: you have set lamorel_args.distributed_setup_args.n_llm_processes=1 and lamorel_args.llm_args.parallelism.model_parallelism_size=2, lamorel will deploy one LLM and will expect at least 2 GPUs on your system to assign them to the LLM. If you set lamorel_args.distributed_setup_args.n_llm_processes=2, lamorel will deploy 2 LLMs and expects at least 4 GPUs on your system (the first 2 will be assigned to the first LLM, the two others two the second LLM).

ClementRomac · 2023-11-21T16:21:14Z

Hi,

The Decoder-Only support has come at last!
Here's the PR: #26

It has been merged into the main branch. All examples have been modified.
Let me know if you face any issue :)

yone456 · 2023-11-22T06:49:15Z

Thanks for the great update!
I immediately tried it with llama2 and got the following error output, but I was able to avoid the error by setting device_map = "auto".

ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

Also, when I performed PPO_Lora_finetuning on llama2, I got the following warning output, is there any solution...?

[2023-11-22 15:06:46,565][root][WARNING] - PPO ratio != 1 !!

ClementRomac · 2023-11-22T10:16:38Z

I can't manage to reproduce your error when loading Llama 2...
If you know what happens let me know.

Concerning the warning, models that use Rotary PE (e.g. Llama 2, Mistral) are affected by padding: huggingface/transformers#25921

As we are batching multiple transitions in the PPOUpdater (and using padding to do so), the logprobs differ from the ones obtained when collecting transitions. I have unfortunately no solution for now. I am actually currently trying to see if I can make Mistral or Llama2 converge even with this issue.

ClementRomac added the bug Something isn't working label Nov 2, 2023

ClementRomac self-assigned this Nov 2, 2023

Josh00-Lu mentioned this issue Nov 11, 2023

How to solve the "ModuleNotFoundError: No module named 'experiments'"? flowersteam/Grounding_LLMs_with_online_RL#19

Open

ClementRomac mentioned this issue Feb 21, 2024

Logprobs don't match on causal models between generate and forward #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection error #23

Connection error #23

yone456 commented Nov 2, 2023

ClementRomac commented Nov 2, 2023

yone456 commented Nov 6, 2023

ClementRomac commented Nov 8, 2023 •

edited

Loading

yone456 commented Nov 9, 2023

ClementRomac commented Nov 10, 2023

yone456 commented Nov 12, 2023

ClementRomac commented Nov 13, 2023

ClementRomac commented Nov 19, 2023

yone456 commented Nov 19, 2023

ClementRomac commented Nov 20, 2023

ClementRomac commented Nov 21, 2023

yone456 commented Nov 22, 2023

ClementRomac commented Nov 22, 2023

Connection error #23

Connection error #23

Comments

yone456 commented Nov 2, 2023

ClementRomac commented Nov 2, 2023

yone456 commented Nov 6, 2023

ClementRomac commented Nov 8, 2023 • edited Loading

yone456 commented Nov 9, 2023

ClementRomac commented Nov 10, 2023

yone456 commented Nov 12, 2023

ClementRomac commented Nov 13, 2023

ClementRomac commented Nov 19, 2023

yone456 commented Nov 19, 2023

ClementRomac commented Nov 20, 2023

ClementRomac commented Nov 21, 2023

yone456 commented Nov 22, 2023

ClementRomac commented Nov 22, 2023

ClementRomac commented Nov 8, 2023 •

edited

Loading