Skip to content

Simplify TokenizerArgs.__post_init__ with Enum Tokenizer Type #1535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 28, 2025

Conversation

zhenyan-zhang-meta
Copy link
Contributor

Summary:
Simplify TokenizerArgs.__post_init__ with enum tokenizer type, since only one of the tokenizer type can be true.

We want to touch as less code outside of __post_init__ as possible at the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518

Summary:
Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since
only one of the tokenizer type can be true.

We want to touch as less code outside of `__post_init__` as possible at
the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518
Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1535

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 23 Pending

As of commit 03e2019 with merge base 5f8f35d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2025
Comment on lines 281 to 282
self.tokenizer_type = TokenizerType.NONE
self.t = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to set these as none again since we already set them at the very top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

is_sentencepiece = self.is_sentencepiece()
is_hf_tokenizer = self.is_hf_tokenizer()

if sum([is_tiktoken, is_hf_tokenizer, is_sentencepiece]) != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace this by just checking if the tokenizer enum is None

Comment on lines 281 to 282
self.tokenizer_type = TokenizerType.NONE
self.t = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

zhenyanzhang and others added 4 commits April 25, 2025 16:27
Summary:
Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since
only one of the tokenizer type can be true.

We want to touch as less code outside of `__post_init__` as possible at
the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, fix the nits then go ahead and merge

self.is_tiktoken = False
self.is_sentencepiece = False
self.is_hf_tokenizer = False
self.t = None
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return is not needed

@@ -294,12 +304,13 @@ def validate_model(
if model is None:
return

if sum([self.is_tiktoken, self.is_hf_tokenizer, self.is_sentencepiece]) != 1:
if self.is_tokenizer_none():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple and check for self.tokenizer_type == TokenizerType.NONE here

We can skip the the assert/helper

@zhenyan-zhang-meta zhenyan-zhang-meta merged commit 0299a37 into main Apr 28, 2025
72 checks passed
@zhenyan-zhang-meta zhenyan-zhang-meta deleted the zhenyan-1518 branch April 28, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants