Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection error #23

Open
yone456 opened this issue Nov 2, 2023 · 13 comments
Open

Connection error #23

yone456 opened this issue Nov 2, 2023 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@yone456
Copy link

yone456 commented Nov 2, 2023

Hello! I tried an experiment using the llama2 13b model and got a CONNECTION ERROR.

RL script

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

LLM server

python -m lamorel_launcher.launch --config-path /home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning --config-name local_gpu_config rl_script_args.path=home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py lamorel_args.accelerate_args.machine_rank=1 lamorel_args.llm_args.model_path=/home/xxx/llama/llama-2-13b

The following error occurred when starting the LLM server after running it using the above command.

ConnectionError: Tried to launch distributed communication on port 30004, but another process is utilizing it. Please specify a different port (such as using the ----main_process_port flag or specifying a different main_process_port in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to 0.

Could you please advise me on how to resolve the error?

@ClementRomac
Copy link
Collaborator

Hi,

I encounter the same error when I launch the RL script first. It appears there is a conflict of master processes when manually launching two processes on the same machine.

I will investigate this. In the meantime there are 2 solutions for you:

  1. Let torch launch the two processes: set num_machines:1 in your config and only launch your RL script as you did in the example above (the lamorel_launcher will ask torch to launch the two processes)
  2. Keep on launching the two processes manually but launch the LLM server first.

@ClementRomac ClementRomac added the bug Something isn't working label Nov 2, 2023
@ClementRomac ClementRomac self-assigned this Nov 2, 2023
@yone456
Copy link
Author

yone456 commented Nov 6, 2023

Thanks for the advice.
I was able to run it thanks to your advice.
I was able to run the Flan-T5 and other models, but not with regard to llama and llama2.
Will you ever cover the llama and llama2 models in the future?

@ClementRomac
Copy link
Collaborator

ClementRomac commented Nov 8, 2023

What is the matter with Llama?

For your information, I am currently working on adding a couple of things (along with several fixes):

  • Quantization (the possibility to load models in 4 bits)
  • Better caching, especially for decoder-only models (which also simplifies how to add custom module functions on top of them)
  • Better support of decoder-only models

With these improvements, I am able to run and train (with QLoRA) models like Llama2, OPT or Mistral.

These should arrive shortly here (in the coming weeks).

@yone456
Copy link
Author

yone456 commented Nov 9, 2023

That news is great news for me.
I am very much looking forward to the arrival of those features.

By the way, I get the following error when I start llama2.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6.60it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.31s/it]
Using pad_token, but it is not set yet.
trainable params: 19401729 || all params: 13035266049 || trainable%: 0.14884029928555545
Error executing job with overrides: ['lamorel_args.accelerate_args.machine_rank=1']
Traceback (most recent call last):
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 393, in
main()
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 255, in main
lm_server = Caller(config_args.lamorel_args,
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/caller.py", line 53, in init
Server(
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 58, in init
DDP(self._model, process_group=self._llm_group,
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in init
self._log_and_throw(
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

@ClementRomac
Copy link
Collaborator

Which model are you using exactly? Also, what are your versions of transformers and accelerate?

@yone456
Copy link
Author

yone456 commented Nov 12, 2023

The version I am using is the following version.

accelerate 0.21.0
transformers 4.33.0

The model used is Llama-2-13b-hf.
https://huggingface.co/meta-llama/Llama-2-13b-hf

@ClementRomac
Copy link
Collaborator

Ok so first to give more details about your initial issue with Connection Error, it's Accelerate that checks the asked port isn't already in use. When a process with a rank > 0 is launched first, the port isn't already in use (at it is first) AND torch distributed doesn't launch anything on this port as only the process with rank=0 should launch the master process. So then when you launch the process with rank=0, the port is still free and everything runs smoothly. However, when you do the opposite, the process with rank=0 (which is launched first) starts the main process listening on the asked port, but Accelerate still checks for the second process with rank > 0 that the port is free.

I guess this check should take into account the rank of the current process. I haven't opened any issue yet as manually launching two "machines" on the same machine isn't really a "normal" use case of Accelerate. So I would advise setting the num_machines:1.

Concerning Llama, this is surprising as it seems the piece of code putting the LLM's weights on a CUDA device is not working as expected and your LLM is still on the fake 'meta' device when passed to DDP. Could you try upgrading Accelerate?

@ClementRomac
Copy link
Collaborator

It may also be related to your pytorch version. See #24.

@yone456
Copy link
Author

yone456 commented Nov 19, 2023

Thanks to your advice the error was avoided. Thank you very much.

Sorry, I have about two questions.
When using a decoder only model (causal) like llama2, I get the following error in main.py of PPO_LoRA_finetuning. Is there any workaround for this error?

File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 65, in init
self.run()
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 131, in run
current_process_results = self._process_calls(calls_to_process)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/server.py", line 109, in _process_calls
llm_results.append(self._model(
_call))
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 285, in forward
results = _fn(_outputs,
File "/home/xxx/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/xxx/Grounding_LLMs_with_online_RL/lamorel/examples/PPO_LoRA_finetuning/main.py", line 39, in forward
raise NotImplementedError()
NotImplementedError

Also, how can I do fine tuning with multiple GPUs in PPO_LoRA_finetuning?

@ClementRomac
Copy link
Collaborator

Hi,

Decoder-Only support is part of the multiple changes I have to push. This update will be added in a PR tomorrow morning. Examples will also be slightly modified, so you may have to adapt your code.

Concerning multi-GPU, if you have set lamorel_args.llm_args.parallelism.use_gpu=true, you should be able to set how many GPUs each LLM uses with lamorel_args.llm_args.parallelism.model_parallelism_size. Example: you have set lamorel_args.distributed_setup_args.n_llm_processes=1 and lamorel_args.llm_args.parallelism.model_parallelism_size=2, lamorel will deploy one LLM and will expect at least 2 GPUs on your system to assign them to the LLM. If you set lamorel_args.distributed_setup_args.n_llm_processes=2, lamorel will deploy 2 LLMs and expects at least 4 GPUs on your system (the first 2 will be assigned to the first LLM, the two others two the second LLM).

@ClementRomac
Copy link
Collaborator

Hi,

The Decoder-Only support has come at last!
Here's the PR: #26

It has been merged into the main branch. All examples have been modified.
Let me know if you face any issue :)

@yone456
Copy link
Author

yone456 commented Nov 22, 2023

Thanks for the great update!
I immediately tried it with llama2 and got the following error output, but I was able to avoid the error by setting device_map = "auto".

ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

Also, when I performed PPO_Lora_finetuning on llama2, I got the following warning output, is there any solution...?

[2023-11-22 15:06:46,565][root][WARNING] - PPO ratio != 1 !!

@ClementRomac
Copy link
Collaborator

I can't manage to reproduce your error when loading Llama 2...
If you know what happens let me know.

Concerning the warning, models that use Rotary PE (e.g. Llama 2, Mistral) are affected by padding: huggingface/transformers#25921

As we are batching multiple transitions in the PPOUpdater (and using padding to do so), the logprobs differ from the ones obtained when collecting transitions. I have unfortunately no solution for now. I am actually currently trying to see if I can make Mistral or Llama2 converge even with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants