removing tensorflow_text for aarch64 compatiblity #883

rdyro · 2024-09-12T00:58:15Z

No description provided.

rdyro · 2024-09-12T04:26:44Z

I'm not sure why Unit Test / Common test (v4-8) (pull_request) is failing

gobbleturk

LGTM, can we also have @aireenmei review?

aireenmei

nit, LGTM in general

MaxText/tokenizer.py

khatwanimohit · 2024-09-13T16:45:53Z

MaxText/tokenizer.py

  for k in data_keys:
    if isinstance(tokenizer, TikTokenTokenizer):
      features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]
    elif isinstance(tokenizer, SentencePieceTokenizer):
-      features[k] = tokenizer.encode(features[k])
+      features[k] = tf.py_function(_process_string, [features[k]], Tout=[tf.int32])[0]


Did you check the performance here? Using a py_function for TikToken affected the performance and that's why the recommendation is to use pygrain/hf with tiktoken

Ok, the performance consideration is something I completely missed. From some of my tests it's either tf.py_function or SentencePiece from sentencepiece that acquires the GIL and significantly impacts performance.

I looked into a bit, but for me even calling an empty tf.py_function throttles the loader down a lot (4x-6x).

Do you have any ideas? @khatwanimohit

rdyro requested a review from gobbleturk as a code owner September 12, 2024 00:58

rdyro marked this pull request as draft September 12, 2024 00:58

rdyro force-pushed the rdyro-remove-tftxt branch 3 times, most recently from 646f83b to 08378df Compare September 12, 2024 02:11

rdyro marked this pull request as ready for review September 12, 2024 04:25

rdyro assigned gobbleturk Sep 12, 2024

rdyro force-pushed the rdyro-remove-tftxt branch 3 times, most recently from 1f7ce19 to bb1666a Compare September 13, 2024 01:20

gobbleturk approved these changes Sep 13, 2024

View reviewed changes

gobbleturk requested a review from aireenmei September 13, 2024 04:58

gobbleturk assigned aireenmei and unassigned gobbleturk Sep 13, 2024

github-actions bot added the pull ready label Sep 13, 2024

aireenmei reviewed Sep 13, 2024

View reviewed changes

MaxText/tokenizer.py Outdated Show resolved Hide resolved

khatwanimohit self-requested a review September 13, 2024 16:42

khatwanimohit reviewed Sep 13, 2024

View reviewed changes

rdyro marked this pull request as draft September 13, 2024 16:52

rdyro force-pushed the rdyro-remove-tftxt branch from bb1666a to b750691 Compare October 30, 2024 00:16

removing tensorflow_text for aarch64 compatiblity

d3ec05f

rdyro force-pushed the rdyro-remove-tftxt branch from b750691 to d3ec05f Compare November 1, 2024 19:02

rdyro added 3 commits November 1, 2024 13:56

making tftxt optional

71ccd92

fixing linter error

f2d77ed

linter error fix

1b6ebbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing tensorflow_text for aarch64 compatiblity #883

removing tensorflow_text for aarch64 compatiblity #883

rdyro commented Sep 12, 2024

rdyro commented Sep 12, 2024

gobbleturk left a comment

aireenmei left a comment

khatwanimohit Sep 13, 2024

rdyro Sep 13, 2024 •

edited

Loading

removing tensorflow_text for aarch64 compatiblity #883

Are you sure you want to change the base?

removing tensorflow_text for aarch64 compatiblity #883

Conversation

rdyro commented Sep 12, 2024

rdyro commented Sep 12, 2024

gobbleturk left a comment

Choose a reason for hiding this comment

aireenmei left a comment

Choose a reason for hiding this comment

khatwanimohit Sep 13, 2024

Choose a reason for hiding this comment

rdyro Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

rdyro Sep 13, 2024 •

edited

Loading