Add llama 31 support #87

Bihan · 2024-08-29T11:26:50Z

What does this PR do?

This PR adds Llama 3.1 8B support. Please refer to the PR for discussion history.

Fixes

Bumped TGI version from 2.0.3 to 2.2.0
Bumped Transformers from 4.41.1 to 4.43.3
Bumped Rust from '1.77to1.79`
Added otlp_service_name argument in serve() method with default value text-generation-inference.server
Added max_input_tokens argument in serve() method with default value None
Modified modelling_llama.py for fixing rope_scaling issue.
Added Llama 3.1 8B test in test_decode.py

Add Llama 3.1 in test_decode.py Set generation_config._eos_token_tensor to None

Add llama 3.1 Support

Bihan · 2024-08-29T11:52:24Z

@tengomucho

Regarding our discussion REF

you are right, the right parameter to use is "MAX_INPUT_TOKENS", I mixed things up. My point is: what do you want to achieve by adding this to the cli in the text generation server? For now IIRC this value is used in the launcher to define the maximum number of input tokens that can be passed from the router to the server. The server for now does not use that. It is ok to add it to the cli, but to be effective you will also need to add it to the serve function, and do something with it, otherwise it will not have any effect.

We only require MAX_TOTAL_TOKENS and MAX_BATCH_PREFILL_TOKENS, but looking deeper into TGI v2.2.0 launcher's main.rs; we found that launcher always passes --max-input-tokens argument to the serve function. This is why we cli.py serve function is issuing Error: No such option: --max-input-tokens rank=0

So how can we handle this --max-input-tokens parameter inside cli.py?

Regarding our discussion REF

We could do that @Bihan , but that would mean that we would end up with a code that diverges more from the original transformers' code. I was suggesting to stay as close as possible to their implementation to simplify maintenance: if there is a new update to transformers to support a new feature or fix a bug, it will be easier to support it if optimum-tpu code is more similar.

I understand your point about maintaining alignment with the original transformers’ code to simplify updates and maintenance. After reviewing the changes in modelling_llama.py from Transformers v4.43.4, I believe the best approach might be to create a new file, optimum/tpu/modelling_llama.py, specifically tailored for TPU.

In the meantime, as a workaround to address the rope_scaling issue, is there a way to load a custom rope_scaling configuration instead of relying on the default rope_scaling in LLaMA 3.1’s config.json?

tengomucho · 2024-08-30T07:46:43Z

optimum/tpu/generation/token_selector.py

@@ -120,6 +120,7 @@ def create(
            )
            generation_config.max_length = max_seq_length

+        generation_config._eos_token_tensor = None


why do you need to modify this?

@tengomucho Sorry, I missed to inform about this. While running TGI with Llama 3.1 8B below was the error. I feel this too is an unreasonable hack to run Llama 3.1 and not a proper way to handle error. Please suggest on how we can handle this.

2024-08-05T16:51:28.059351Z INFO text_generation_launcher: Warmup (this can take several minutes) 2024-08-05T16:51:28.062291Z DEBUG text_generation_launcher: Generator@0 WARMUP 2024-08-05T16:51:28.062400Z DEBUG text_generation_launcher: Warming up the model 2024-08-05T16:51:28.063135Z DEBUG text_generation_launcher: Warmup for 1 requests, truncate value 256 seq_len 150 2024-08-05T16:51:28.063287Z DEBUG text_generation_launcher: Prefilling 1 new request(s) adding to 0 active slot(s) 2024-08-05T16:51:28.063567Z DEBUG text_generation_launcher: Request 0 assigned to slot 0 2024-08-05T16:51:28.066032Z DEBUG text_generation_launcher: Error in command WARMUP 2024-08-05T16:51:28.067444Z DEBUG text_generation_launcher: Traceback (most recent call last): 2024-08-05T16:51:28.067451Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 842, in _mp_fn 2024-08-05T16:51:28.067455Z DEBUG text_generation_launcher: return_to_caller(generator.warmup(batch=batch)) 2024-08-05T16:51:28.067457Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 420, in warmup 2024-08-05T16:51:28.067475Z DEBUG text_generation_launcher: _generations, next_batch = self.prefill(warmup_batch) 2024-08-05T16:51:28.067480Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-08-05T16:51:28.067484Z DEBUG text_generation_launcher: return func(*args, **kwargs) 2024-08-05T16:51:28.067493Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 513, in prefill 2024-08-05T16:51:28.067496Z DEBUG text_generation_launcher: selector = TokenSelector.create( 2024-08-05T16:51:28.067498Z DEBUG text_generation_launcher: File "/opt/optimum-tpu/optimum/tpu/generation/token_selector.py", line 124, in create 2024-08-05T16:51:28.067501Z DEBUG text_generation_launcher: logits_processor = model._get_logits_processor( 2024-08-05T16:51:28.067504Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 871, in _get_logits_processor 2024-08-05T16:51:28.067508Z DEBUG text_generation_launcher: and generation_config._eos_token_tensor is not None 2024-08-05T16:51:28.067512Z DEBUG text_generation_launcher: AttributeError: 'GenerationConfig' object has no attribute '_eos_token_tensor' 2024-08-05T16:51:28.067521Z DEBUG text_generation_launcher: 2024-08-05T16:51:28.067780Z ERROR text_generation_launcher: Method Warmup encountered an error.

Ok I see, this is a bug in transformers, I see an issue open about it already: huggingface/transformers#32207
I will try to open a PR to fix that in transformers. In the meanwhile, a workaround could be done in the generator.py file. We already modify the generation_config to suit TGI, so we could add one more change in the Slot.assign method, something like this

# Workaround to avoid bug in token_utils in transformers. self._generation_config._eos_token_tensor = getattr(self._generation_config, "_eos_token_tensor", None)

HuggingFaceDocBuilderDev · 2024-08-30T08:20:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Bihan · 2024-09-10T17:38:29Z

This PR is closed as it does not contain recent changes in the main branch. A new PR is created as replacement.

tengomucho · 2024-09-11T07:36:39Z

FYI @Bihan, next time you can just rebase to the master branch and force-push:

git checkout main
git pull
git checkout mybranch
git rebase main
# resolve conflicts
git push --force

This way you do not need to open a new PR 🤗

Bihan Rana and others added 2 commits August 29, 2024 15:41

Add llama 3.1 Support

3aed4b8

Add Llama 3.1 in test_decode.py Set generation_config._eos_token_tensor to None

Merge pull request #1 from Bihan/add_llama_31_support

24b54e1

Add llama 3.1 Support

tengomucho reviewed Aug 30, 2024

View reviewed changes

Bihan mentioned this pull request Sep 10, 2024

Add llama 31 support updated #92

Closed

Bihan closed this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama 31 support #87

Add llama 31 support #87

Bihan commented Aug 29, 2024

Bihan commented Aug 29, 2024 •

edited

Loading

tengomucho Aug 30, 2024

Bihan Aug 30, 2024 •

edited

Loading

tengomucho Aug 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 30, 2024

Bihan commented Sep 10, 2024

tengomucho commented Sep 11, 2024

Add llama 31 support #87

Add llama 31 support #87

Conversation

Bihan commented Aug 29, 2024

What does this PR do?

Fixes

Bihan commented Aug 29, 2024 • edited Loading

tengomucho Aug 30, 2024

Choose a reason for hiding this comment

Bihan Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

tengomucho Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 30, 2024

Bihan commented Sep 10, 2024

tengomucho commented Sep 11, 2024

Bihan commented Aug 29, 2024 •

edited

Loading

Bihan Aug 30, 2024 •

edited

Loading

tengomucho Aug 30, 2024 •

edited

Loading