Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing tensorflow_text for aarch64 compatiblity #883

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rdyro
Copy link
Collaborator

@rdyro rdyro commented Sep 12, 2024

No description provided.

@rdyro rdyro marked this pull request as draft September 12, 2024 00:58
@rdyro rdyro force-pushed the rdyro-remove-tftxt branch 3 times, most recently from 646f83b to 08378df Compare September 12, 2024 02:11
@rdyro rdyro marked this pull request as ready for review September 12, 2024 04:25
@rdyro
Copy link
Collaborator Author

rdyro commented Sep 12, 2024

I'm not sure why Unit Test / Common test (v4-8) (pull_request) is failing

@rdyro rdyro force-pushed the rdyro-remove-tftxt branch 2 times, most recently from afd0e23 to 1f7ce19 Compare September 12, 2024 23:02
Copy link
Collaborator

@gobbleturk gobbleturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can we also have @aireenmei review?

Copy link
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, LGTM in general

for k in data_keys:
if isinstance(tokenizer, TikTokenTokenizer):
features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]
elif isinstance(tokenizer, SentencePieceTokenizer):
features[k] = tokenizer.encode(features[k])
features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need an if-elif statement here?

for k in data_keys:
if isinstance(tokenizer, TikTokenTokenizer):
features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]
elif isinstance(tokenizer, SentencePieceTokenizer):
features[k] = tokenizer.encode(features[k])
features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the performance here? Using a py_function for TikToken affected the performance and that's why the recommendation is to use pygrain/hf with tiktoken

Copy link
Collaborator Author

@rdyro rdyro Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the performance consideration is something I completely missed. From some of my tests it's either tf.py_function or SentencePiece from sentencepiece that acquires the GIL and significantly impacts performance.

I looked into a bit, but for me even calling an empty tf.py_function throttles the loader down a lot (4x-6x).

Do you have any ideas? @khatwanimohit

@rdyro rdyro marked this pull request as draft September 13, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants