Remove unnecessary Tokenizer library calls #155

dyastremsky · 2024-10-25T18:13:07Z

This pull request accomplishes two things by removing unnecessary Tokenizer calls:

Removes the default value for a tokenizer in InputsConfig, so that GenAI-Perf can be used in an air-gapped environment. Today, it cannot due the default tokenizer value in InputsConfig. It tries to get the default tokenizer from HuggingFace, even when a local path is provided for the tokenizer.
Reduces the time for a --help call or parser error return by 1.5-4x by delaying the Tokenizer import until is needed.

Loading times: genai-perf profile --help

First call:
Before:

After:

Repeated calls:
Before:

After:

Loading times: genai-perf error, first run
Before:

After:

Remove unnecessary import

dyastremsky · 2024-10-25T18:21:44Z

genai-perf/genai_perf/tokenizer.py

@@ -58,7 +57,7 @@ def __init__(self, name: str, trust_remote_code: bool, revision: str) -> None:
        self._encode_args = {"add_special_tokens": False}
        self._decode_args = {"skip_special_tokens": True}

-    def __call__(self, text, **kwargs) -> BatchEncoding:
+    def __call__(self, text, **kwargs) -> "BatchEncoding":


This is a use of forward references, defined in PEP 484. This defers the resolution of the typing to when it is needed. The type is only needed for MyPy, which can do the type checking due to the above import that occurs only during static type checking.

genai-perf/genai_perf/inputs/converters/tensorrtllm_engine_converter.py

rmccorm4 · 2024-10-25T20:06:09Z

genai-perf/genai_perf/inputs/converters/tensorrtllm_engine_converter.py

@@ -50,9 +51,10 @@ def convert(
    ) -> Dict[Any, Any]:
        request_body: Dict[str, Any] = {"data": []}

+        tokenizer = cast(Tokenizer, config.tokenizer)


This casting stuff seems strange - Why isn't config.tokenizer already of type Tokenizer? It looks to me like the genai_perf.tokenizer.Tokenizer class has the necessary wrapper methods needed to be used in-place, ex: __call__, encode, decode. So instead of casting, is config.tokenizer set to the right type/value correctly wherever it's initialized?

Seems more prone to error to have these extra places that use tokenizer instead of self.config.tokenizer, and possibly not having a single source of truth in all the places that touch it

Thanks for reviewing! Yes, it's casting it to the genai_perf.tokenizer.Tokenizer class. The cast is only performed during static type checking (i.e. by MyPy), not during normal runtime. It's how we deal with situations where a variable is an Optional (i.e. can be a specified type or none).

This is necessary due to this change. If the type is not an Optional, then it needs a default value other than None, which will require a call to HuggingFace and fail in an air-gapped environment. If the type is an optional, then static type checkers will complain without the cast.

The two options I see are:

Removing the default value in InputsConfig, since None isn't really a default value. However, parameters without default values need to be listed first, so I'd need to break the consistency of the file where every parameter has a default value and is grouped.

I can in-line cast it, which might reduce legibility but avoids the need for a second variable and a copy during runtime. It also maintains a single source of truth.

I'm push a commit with approach 2. If you prefer 1 or a different option, let me know.

I see, thanks for the explanation!

Is the tokenizer truly Optional? Or you're just using Optional as a means of deferring initialization?

If the latter - can't we initialize our Tokenizer wrapper class up front, but defer the HF tokenizer initialization to whenever it's needed later within the wrapper class?

I think this could do both (a) not always initialize HF on startup and (b) not introduce complications in the type checking. Let me know what you think.

It's not truly optional. The HF tokenizer initialization always happens in the code as soon as the class is initialized (it's never empty).

That's a good point. I made it non-optional. To avoid clunkier code (multiple lines of code) every time a tokenizer is initialized for the config in the code and testing, I created a way to initialize the class without initializing the HF tokenizer for testing purposes, then updated all the tests that previously did not supply a tokenizer. Let me know if the changes work for you.

Didn't mean for this to turn into changing many more lines of code to change how the tokenizer gets passed in everywhere. I thought it would be of a similar impact radius as your original tokenizer: Optional[Tokenizer] = None default to something like tokenizer: Tokenizer = Tokenizer() # HF AutoTokenizer deferred until later.

Questions were food for thought on whether the Optional approach made sense.

I'm happy with the current code, or the in-line casting version. I defer to the other tools folks on their preference.

Unfortunately, doing it that way still made it clunkier and impacted a decent bit of code, because the tokenizer only gets passed as a parameter when instantiating InputsConfig (i.e. config = InputsConfig(...tokenizer=get_tokenizer(tokenizer_args), so when I tried to do it that way, it still necessitated adding a couple of lines in each place to create a temporary tokenizer variable and setting its value in order to pass that tokenizer in. Excluding testing, this is probably the most correct way to do it in terms of Tokenizer usages and type checks.

I'll merge this change so that we can support users who need to use GenAI-Perf in airgapped environments. We can always refactor later on if we find or prefer a different approach.

Thanks for reviewing, Ryan! Your feedback was helpful, as always. 👏

rmccorm4

#155 (comment)

dyastremsky self-assigned this Oct 25, 2024

dyastremsky temporarily deployed to GITLAB October 25, 2024 18:13 — with GitHub Actions Inactive

Cast tokenizer where needed

9e25f4e

Remove unnecessary import

dyastremsky force-pushed the dyas-default-tokenizer branch from 76d5921 to 9e25f4e Compare October 25, 2024 18:14

dyastremsky temporarily deployed to GITLAB October 25, 2024 18:14 — with GitHub Actions Inactive

dyastremsky commented Oct 25, 2024

View reviewed changes

dyastremsky requested review from nicolasnoble and rmccorm4 October 25, 2024 18:22

matthewkotila reviewed Oct 25, 2024

View reviewed changes

genai-perf/genai_perf/inputs/converters/tensorrtllm_engine_converter.py Outdated Show resolved Hide resolved

rmccorm4 reviewed Oct 25, 2024

View reviewed changes

Make casts in-line to avoid extra var

5822426

dyastremsky temporarily deployed to GITLAB October 25, 2024 20:51 — with GitHub Actions Inactive

dyastremsky requested a review from rmccorm4 October 25, 2024 20:53

Merge branch 'main' into dyas-default-tokenizer

ada0c12

dyastremsky temporarily deployed to GITLAB October 25, 2024 23:27 — with GitHub Actions Inactive

dyastremsky temporarily deployed to GITLAB October 25, 2024 23:28 — with GitHub Actions Inactive

Merge branch 'main' into dyas-default-tokenizer

49fee42

dyastremsky temporarily deployed to GITLAB November 4, 2024 18:11 — with GitHub Actions Inactive

Make tokenizer not optional

91cd63b

dyastremsky temporarily deployed to GITLAB November 4, 2024 20:09 — with GitHub Actions Inactive

Update tests to incorporat required tokenizer

9378566

dyastremsky temporarily deployed to GITLAB November 4, 2024 20:33 — with GitHub Actions Inactive

dyastremsky requested a review from matthewkotila November 4, 2024 20:38

Merge branch 'main' into dyas-default-tokenizer

f259359

dyastremsky temporarily deployed to GITLAB November 4, 2024 22:58 — with GitHub Actions Inactive

dyastremsky temporarily deployed to GITLAB November 4, 2024 22:59 — with GitHub Actions Inactive

rmccorm4 approved these changes Nov 5, 2024

View reviewed changes

dyastremsky merged commit 70aca78 into main Nov 5, 2024
6 of 7 checks passed

dyastremsky deleted the dyas-default-tokenizer branch November 5, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary Tokenizer library calls #155

Remove unnecessary Tokenizer library calls #155

dyastremsky commented Oct 25, 2024 •

edited

Loading

dyastremsky Oct 25, 2024

rmccorm4 Oct 25, 2024

rmccorm4 Oct 25, 2024

dyastremsky Oct 25, 2024 •

edited

Loading

rmccorm4 Oct 29, 2024

dyastremsky Nov 4, 2024

rmccorm4 Nov 5, 2024

dyastremsky Nov 5, 2024

rmccorm4 left a comment

Remove unnecessary Tokenizer library calls #155

Remove unnecessary Tokenizer library calls #155

Conversation

dyastremsky commented Oct 25, 2024 • edited Loading

dyastremsky Oct 25, 2024

Choose a reason for hiding this comment

rmccorm4 Oct 25, 2024

Choose a reason for hiding this comment

rmccorm4 Oct 25, 2024

Choose a reason for hiding this comment

dyastremsky Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

rmccorm4 Oct 29, 2024

Choose a reason for hiding this comment

dyastremsky Nov 4, 2024

Choose a reason for hiding this comment

rmccorm4 Nov 5, 2024

Choose a reason for hiding this comment

dyastremsky Nov 5, 2024

Choose a reason for hiding this comment

rmccorm4 left a comment

Choose a reason for hiding this comment

dyastremsky commented Oct 25, 2024 •

edited

Loading

dyastremsky Oct 25, 2024 •

edited

Loading