Skip to content

Truncated Printing of the Tokenization Results for issue #90 #121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 15, 2025

Conversation

emmanuel-stone
Copy link
Contributor

@emmanuel-stone emmanuel-stone commented Mar 15, 2025

Hi! Attached please find the possible fix for issue #90

Thank you for considering my submission, and I truly appreciate it.

Summary

Added __repr__ method for Tokenized class at bm25s/tokenization.py ,

and added CorpusIDsList class that inherits the built in class of python list,

in order to make the printing of the tokenization results of the large corpus as follows.

return_ids=True

Tokenized(
  "ids": [
    0: [0, 1, 2, 3]
    1: [4, 5, 6, 7, 8, 9]
    2: [10, 11, 12, 13, 14]
    3: [15, 16, 17, 18, 19]
    4: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
    5: [0, 1, 2, 3]
    6: [4, 5, 6, 7, 8, 9]
    7: [10, 11, 12, 13, 14]
    8: [15, 16, 17, 18, 19]
    9: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
    ... (total 500000 docs)
  ],
  "vocab": [
: 29
    'animal': 12
    'beautiful': 11
    'best': 6
    'bird': 10
    'can': 13
    'carefully': 27
    'casually': 28
    'cat': 0
    'creature': 16
    ... (total 30 tokens)
  ],
)

return_ids=False

CorpusIDsList(
  0: [cat, feline, likes, purr]
  1: [dog, human, best, friend, loves, play]
  2: [bird, beautiful, animal, can, fly]
  3: [fish, creature, lives, water, swims]
  4: [cat, feline, likes, purr, cat, may, like, jump, very, high, ...]
  5: [cat, feline, likes, purr]
  6: [dog, human, best, friend, loves, play]
  7: [bird, beautiful, animal, can, fly]
  8: [fish, creature, lives, water, swims]
  9: [cat, feline, likes, purr, cat, may, like, jump, very, high, ...]
  ... (total 500000 docs)
)

Details

bm25s/tokenization.py

return_ids=True

At Tokenized class, added __repr__ that can print the pretty output. It was inspired from the examples of pandas’ DataFrame or HuggingFace’s truncated output of tensors. The package pandas uses some formatting helpers to set each of the lines. I tried to implement such methods, but for current purpose, the helper module was deemed possibly excessive; hence the current version. I reckoned that I would use tab instead of the spaces, but thought that spaces would be better for uniform representation over diverse environments.

return_ids=False

Added a new class CorpusIDsList that extends Python's native list, so that the returned types would be consistent. The name was chosen, following the original return values, for the coherence of the reading.

tests/core/test_tokenizer_misc.py

Added tests (test_truncation_of_large_corpus and test_truncation_of_small_corpus), in order to check the marker of truncation , ...] or ... (total.

Some more details

The reproduction code that I have used for the issue was as follows, using the examples in the original readme.

import bm25s
import Stemmer

corpus_very_small = [
    "a cat is a feline and likes to purr",
]

corpus_small = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
    # a line more than 10 tokens
    "a cat is a feline and likes to purr and the cat may like to jump very high still so carefully and casually",
]
stemmer = Stemmer.Stemmer("english")
corpus_large = []
for i in range(100000):
    corpus_large += corpus_small

type2corpus = {
    "VERY_SMALL": corpus_very_small,
    "SMALL": corpus_small,
    "LARGE": corpus_large,
}

for corpus_type, target_corpus in type2corpus.items():
    print("*"*30)
    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NONE | RETURN_IDS=TRUE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=None, return_ids=True)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NONE | RETURN_IDS=FALSE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=None, return_ids=False)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    stemmer = Stemmer.Stemmer("english")
    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NAIVE | RETURN_IDS=TRUE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=stemmer, return_ids=True)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NAIVE | RETURN_IDS=FALSE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=stemmer, return_ids=False)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

I have ran the local tests at tests/core, to confirm the added lines would not conflict with the original code base. If there are further improvements or alternative approaches you’d prefer, please let me know; I’m happy to edit or modify this further following your feedback, and I am looking forward to it.

@xhluca
Copy link
Owner

xhluca commented Mar 17, 2025

Hi I'm quite busy with a submission so won't have a chance to review this for the next 2 weeks. Will review after that!

@emmanuel-stone
Copy link
Contributor Author

emmanuel-stone commented Mar 18, 2025

Thank you very much for the heads up! Please take time! I hope to have a chance to learn from your feedback, so looking forward to it :) Thank you again and hope that you have all the best luck for your submission.

Copy link
Owner

@xhluca xhluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove CorpusIDsList? I do not believe it adds value. the issue #90 only applies to printing, rather than the actual returned object. There's no benefit to create a custom list object inherited from list as far as I can see in this scenario, but adds more complexity to the code. I prefer occam's razor here.

@@ -634,5 +748,4 @@ def tokenize(
)
):
corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids]

return corpus_ids
return CorpusIDsList(corpus_ids)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea, because it changes the return signature and would thus create a breaking change. I also do not see the benefit of abstracting the returned object with another class. Can you revert it back to the original returned object and remove CorpusIDsList?

7: [10, 11, 12, 13, 14]
8: [15, 16, 17, 18, 19]
9: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
... (total 500000 docs)
Copy link
Owner

@xhluca xhluca Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea: ... (total 500000 docs)

@xhluca xhluca requested a review from Copilot April 8, 2025 04:42
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

bm25s/tokenization.py:110

  • [nitpick] Consider using the variable 'lines_print_max_num' instead of the hard-coded value '10' to maintain consistency in the truncation logic for displaying vocab items.
if len(list(vocab_keys)) > 10:

corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False)
repr_corpus_ids = repr(corpus_ids)
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))
Copy link
Preview

Copilot AI Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_truncation_of_large_corpus, the assertion compares repr_corpus_ids to its own string conversion, which is always true. It should compare repr_corpus_ids to str(corpus_ids) to correctly validate the repr output.

Suggested change
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))
self.assertEqual(repr_corpus_ids, str(corpus_ids))

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False)
repr_corpus_ids = repr(corpus_ids)
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))
Copy link
Preview

Copilot AI Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_truncation_of_small_corpus, the assertion mistakenly compares repr_corpus_ids to its own string conversion instead of comparing it to str(corpus_ids). Adjusting this check will better verify the intended repr behavior.

Suggested change
self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))
self.assertEqual(repr_corpus_ids, str(corpus_ids))

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

@emmanuel-stone
Copy link
Contributor Author

thank you for such a kind and helpful advice, I really appreciate it. I will try to read through the suggestion and try to follow the advice and the guidance. Please bear with me because I am rather a novice at OSS. Again, thank you very much!

@xhluca
Copy link
Owner

xhluca commented Apr 8, 2025

No worries, take your time!

@emmanuel-stone
Copy link
Contributor Author

emmanuel-stone commented Apr 12, 2025

Thank you again, for taking the time, to review the code and to provide your valuable feedback. I’ve pushed two new commits to address your feedback on this pull request.

Specifically:

  1. Removed CorpusIDsList and reverted the code to return the original list object, as you recommended.
  2. Updated the tests to follow Copilot’s suggestion for comparing repr_corpus_ids with str(corpus_ids)

The added new commits are:

If there’s anything else you’d like me to refine or edit or clarify, please let me know. Thank you once again for your thoughtful feedback and support; I really appreciate it.

Looking forward to your thoughts,
with kind and warm regards, emmanuel

Copy link
Owner

@xhluca xhluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@xhluca xhluca merged commit e7690fa into xhluca:main Apr 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants