Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Up Tokenization Through Multiprocessing #347

Merged
merged 1 commit into from
Jan 13, 2024

Conversation

donglihe-hub
Copy link
Contributor

@donglihe-hub donglihe-hub commented Jan 12, 2024

What does this PR do?

For large datasets, tokenization can take a lot of time (40 mins for AmazonCat-13K using nltk.word_tokenize). Since Instances are independent from each other during tokenization, multiprocessing can speed up the process.

Currently LibMultiLabel didn't produce any information during tokenization. Users could assume the program gets stuck if they provide a large dataset. Thus, one extra thing I did is adding tqdm to tokenization, which helps users to know what LibMultiLabel is doing.

Test CLI & API (bash tests/autotest.sh)

Test APIs used by main.py.

  • Test Pass
    • (Copy and paste the last outputted line here.)
  • Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

  • API document is updated (linear, nn)
  • Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (bash tests/docs/test_changed_document.sh)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

@Gordon119
Copy link
Collaborator

Looks good to me. How about @Eleven1Liu?

@Eleven1Liu
Copy link
Collaborator

Good

@Gordon119 Gordon119 merged commit 17c022e into ASUS-AICS:master Jan 13, 2024
1 check passed
@donglihe-hub donglihe-hub deleted the concurent branch January 23, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants