Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama 31 support #87

Closed
wants to merge 2 commits into from
Closed

Conversation

Bihan
Copy link

@Bihan Bihan commented Aug 29, 2024

What does this PR do?

This PR adds Llama 3.1 8B support. Please refer to the PR for discussion history.

Fixes

  • Bumped TGI version from 2.0.3 to 2.2.0
  • Bumped Transformers from 4.41.1 to 4.43.3
  • Bumped Rust from '1.77to1.79`
  • Added otlp_service_name argument in serve() method with default value text-generation-inference.server
  • Added max_input_tokens argument in serve() method with default value None
  • Modified modelling_llama.py for fixing rope_scaling issue.
  • Added Llama 3.1 8B test in test_decode.py

Bihan Rana and others added 2 commits August 29, 2024 15:41
Add Llama 3.1 in test_decode.py

Set generation_config._eos_token_tensor to None
@Bihan
Copy link
Author

Bihan commented Aug 29, 2024

@tengomucho

  1. Regarding our discussion REF

you are right, the right parameter to use is "MAX_INPUT_TOKENS", I mixed things up. My point is: what do you want to achieve by adding this to the cli in the text generation server? For now IIRC this value is used in the launcher to define the maximum number of input tokens that can be passed from the router to the server. The server for now does not use that. It is ok to add it to the cli, but to be effective you will also need to add it to the serve function, and do something with it, otherwise it will not have any effect.

We only require MAX_TOTAL_TOKENS and MAX_BATCH_PREFILL_TOKENS, but looking deeper into TGI v2.2.0 launcher's main.rs; we found that launcher always passes --max-input-tokens argument to the serve function. This is why we cli.py serve function is issuing Error: No such option: --max-input-tokens rank=0

So how can we handle this --max-input-tokens parameter inside cli.py?

  1. Regarding our discussion REF

We could do that @Bihan , but that would mean that we would end up with a code that diverges more from the original transformers' code. I was suggesting to stay as close as possible to their implementation to simplify maintenance: if there is a new update to transformers to support a new feature or fix a bug, it will be easier to support it if optimum-tpu code is more similar.

I understand your point about maintaining alignment with the original transformers’ code to simplify updates and maintenance. After reviewing the changes in modelling_llama.py from Transformers v4.43.4, I believe the best approach might be to create a new file, optimum/tpu/modelling_llama.py, specifically tailored for TPU.

In the meantime, as a workaround to address the rope_scaling issue, is there a way to load a custom rope_scaling configuration instead of relying on the default rope_scaling in LLaMA 3.1’s config.json?

@@ -120,6 +120,7 @@ def create(
)
generation_config.max_length = max_seq_length

generation_config._eos_token_tensor = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to modify this?

Copy link
Author

@Bihan Bihan Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengomucho Sorry, I missed to inform about this. While running TGI with Llama 3.1 8B below was the error. I feel this too is an unreasonable hack to run Llama 3.1 and not a proper way to handle error. Please suggest on how we can handle this.

2024-08-05T16:51:28.059351Z  INFO text_generation_launcher: Warmup (this can take several minutes)
2024-08-05T16:51:28.062291Z DEBUG text_generation_launcher: Generator@0 WARMUP
2024-08-05T16:51:28.062400Z DEBUG text_generation_launcher: Warming up the model
2024-08-05T16:51:28.063135Z DEBUG text_generation_launcher: Warmup for 1 requests, truncate value 256 seq_len 150
2024-08-05T16:51:28.063287Z DEBUG text_generation_launcher: Prefilling 1 new request(s) adding to 0 active slot(s)
2024-08-05T16:51:28.063567Z DEBUG text_generation_launcher: Request 0 assigned to slot 0
2024-08-05T16:51:28.066032Z DEBUG text_generation_launcher: Error in command WARMUP
2024-08-05T16:51:28.067444Z DEBUG text_generation_launcher: Traceback (most recent call last):
2024-08-05T16:51:28.067451Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 842, in _mp_fn
2024-08-05T16:51:28.067455Z DEBUG text_generation_launcher:     return_to_caller(generator.warmup(batch=batch))
2024-08-05T16:51:28.067457Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 420, in warmup
2024-08-05T16:51:28.067475Z DEBUG text_generation_launcher:     _generations, next_batch = self.prefill(warmup_batch)
2024-08-05T16:51:28.067480Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-05T16:51:28.067484Z DEBUG text_generation_launcher:     return func(*args, **kwargs)
2024-08-05T16:51:28.067493Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 513, in prefill
2024-08-05T16:51:28.067496Z DEBUG text_generation_launcher:     selector = TokenSelector.create(
2024-08-05T16:51:28.067498Z DEBUG text_generation_launcher:   File "/opt/optimum-tpu/optimum/tpu/generation/token_selector.py", line 124, in create
2024-08-05T16:51:28.067501Z DEBUG text_generation_launcher:     logits_processor = model._get_logits_processor(
2024-08-05T16:51:28.067504Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 871, in _get_logits_processor
2024-08-05T16:51:28.067508Z DEBUG text_generation_launcher:     and generation_config._eos_token_tensor is not None
2024-08-05T16:51:28.067512Z DEBUG text_generation_launcher: AttributeError: 'GenerationConfig' object has no attribute '_eos_token_tensor'
2024-08-05T16:51:28.067521Z DEBUG text_generation_launcher: 
2024-08-05T16:51:28.067780Z ERROR text_generation_launcher: Method Warmup encountered an error.

Copy link
Collaborator

@tengomucho tengomucho Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, this is a bug in transformers, I see an issue open about it already: huggingface/transformers#32207
I will try to open a PR to fix that in transformers. In the meanwhile, a workaround could be done in the generator.py file. We already modify the generation_config to suit TGI, so we could add one more change in the Slot.assign method, something like this

    # Workaround to avoid bug in token_utils in transformers.
    self._generation_config._eos_token_tensor = getattr(self._generation_config, "_eos_token_tensor", None)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Bihan
Copy link
Author

Bihan commented Sep 10, 2024

This PR is closed as it does not contain recent changes in the main branch. A new PR is created as replacement.

@Bihan Bihan closed this Sep 10, 2024
@tengomucho
Copy link
Collaborator

FYI @Bihan, next time you can just rebase to the master branch and force-push:

git checkout main
git pull
git checkout mybranch
git rebase main
# resolve conflicts
git push --force

This way you do not need to open a new PR 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants