Fix/extend re replacement seq #948

saattrupdan · 2024-06-10T09:15:48Z

This PR is an extension of #763, related to extending the re_replacement_seq regex.

The new NorwAI models use a tokenizer that has the token �., which leads to the same error as was described in the previous issue #762.

This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

…ens`

…re_tokens

saattrupdan · 2024-06-10T10:22:14Z

@rlouf The tests are failing as we need a Hugging Face token to load the NorwAI tokeniser, which is the reason why this PR is needed in the first place. Would it be possible to have a Hugging Face token as a Github secret to deal with this?

rlouf · 2024-06-11T09:19:25Z

HUGGINGFACE_API_TOKEN should now be available in actions

…n/outlines into fix/extend-re-replacement-seq

saattrupdan · 2024-06-11T14:22:47Z

HUGGINGFACE_API_TOKEN should now be available in actions

That's great, thanks. Can you please get access to the following models with that token?

meta-llama/Meta-Llama-3-8B
mistralai/Mistral-7B-v0.3
google/gemma-2b
NorwAI/NorwAI-Mistral-7B-instruct

Those are the culprits of the failing tests.

Also, just to be sure, the HUGGINGFACE_API_TOKEN is an access token and not a token for the Hugging Face inference API, right?

Alternatively, if that's too much of a hassle, I can simply include the failure cases manually, rather than accessing them from "real" tokenisers. Let me know what you think.

rlouf · 2024-06-11T19:32:27Z

It's probably better to do this "manually" indeed.

…directly

saattrupdan · 2024-06-12T07:51:08Z

@rlouf Changed the test to a "manual" one now, and all tests pass 🙂

rlouf · 2024-06-12T11:58:17Z

Awesome, thank you!

This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

saattrupdan added 7 commits June 10, 2024 10:42

fix: Add optional "\.*" postfix to re_replacement_seq

9bfa908

tests: Add test_reduced_vocabulary_with_rare_tokens

d97a812

fix: Typo in regex

50cddec

tests: Add more model cases to `test_reduced_vocabulary_with_rare_tok…

75c18bc

…ens`

docs: Add comment explaining the re_replacement_seq changes

65eccc1

tests: Remove the olmo tokenizer from test_reduced_vocabulary_with_ra…

baff354

…re_tokens

style: Black

ef03abb

saattrupdan mentioned this pull request Jun 10, 2024

[MODEL EVALUATION REQUEST] NorwAI/NorwAI-Mistral-7B-instruct ScandEval/ScandEval#453

Closed

8 tasks

Merge branch 'main' into fix/extend-re-replacement-seq

7b9307e

Merge branch 'main' into fix/extend-re-replacement-seq

3ba75ed

saattrupdan and others added 4 commits June 11, 2024 15:53

Merge branch 'main' into fix/extend-re-replacement-seq

a0c9a3e

tests: Use HUGGINGFACE_API_TOKEN

26e4e3c

Merge branch 'fix/extend-re-replacement-seq' of github.com:saattrupda…

6b43d76

…n/outlines into fix/extend-re-replacement-seq

style: Isort

9f2a746

saattrupdan added 2 commits June 11, 2024 22:30

tests: Change test_reduced_vocabulary_with_rare_tokens to use tokens …

5f646f3

…directly

docs: Docstring change to trigger CI

605afd2

rlouf merged commit a987159 into dottxt-ai:main Jun 12, 2024
7 checks passed

saattrupdan deleted the fix/extend-re-replacement-seq branch June 12, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/extend re replacement seq #948

Fix/extend re replacement seq #948

saattrupdan commented Jun 10, 2024

saattrupdan commented Jun 10, 2024

rlouf commented Jun 11, 2024

saattrupdan commented Jun 11, 2024 •

edited

Loading

rlouf commented Jun 11, 2024

saattrupdan commented Jun 12, 2024

rlouf commented Jun 12, 2024

Fix/extend re replacement seq #948

Fix/extend re replacement seq #948

Conversation

saattrupdan commented Jun 10, 2024

saattrupdan commented Jun 10, 2024

rlouf commented Jun 11, 2024

saattrupdan commented Jun 11, 2024 • edited Loading

rlouf commented Jun 11, 2024

saattrupdan commented Jun 12, 2024

rlouf commented Jun 12, 2024

saattrupdan commented Jun 11, 2024 •

edited

Loading