Truncated Printing of the Tokenization Results for issue #90 #121

emmanuel-stone · 2025-03-15T10:39:46Z

Hi! Attached please find the possible fix for issue #90

Thank you for considering my submission, and I truly appreciate it.

Summary

Added __repr__ method for Tokenized class at bm25s/tokenization.py ,

and added CorpusIDsList class that inherits the built in class of python list,

in order to make the printing of the tokenization results of the large corpus as follows.

`return_ids=True`

Tokenized(
  "ids": [
    0: [0, 1, 2, 3]
    1: [4, 5, 6, 7, 8, 9]
    2: [10, 11, 12, 13, 14]
    3: [15, 16, 17, 18, 19]
    4: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
    5: [0, 1, 2, 3]
    6: [4, 5, 6, 7, 8, 9]
    7: [10, 11, 12, 13, 14]
    8: [15, 16, 17, 18, 19]
    9: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
    ... (total 500000 docs)
  ],
  "vocab": [
: 29
    'animal': 12
    'beautiful': 11
    'best': 6
    'bird': 10
    'can': 13
    'carefully': 27
    'casually': 28
    'cat': 0
    'creature': 16
    ... (total 30 tokens)
  ],
)

`return_ids=False`

CorpusIDsList(
  0: [cat, feline, likes, purr]
  1: [dog, human, best, friend, loves, play]
  2: [bird, beautiful, animal, can, fly]
  3: [fish, creature, lives, water, swims]
  4: [cat, feline, likes, purr, cat, may, like, jump, very, high, ...]
  5: [cat, feline, likes, purr]
  6: [dog, human, best, friend, loves, play]
  7: [bird, beautiful, animal, can, fly]
  8: [fish, creature, lives, water, swims]
  9: [cat, feline, likes, purr, cat, may, like, jump, very, high, ...]
  ... (total 500000 docs)
)

Details

`bm25s/tokenization.py`

`return_ids=True`

At Tokenized class, added __repr__ that can print the pretty output. It was inspired from the examples of pandas’ DataFrame or HuggingFace’s truncated output of tensors. The package pandas uses some formatting helpers to set each of the lines. I tried to implement such methods, but for current purpose, the helper module was deemed possibly excessive; hence the current version. I reckoned that I would use tab instead of the spaces, but thought that spaces would be better for uniform representation over diverse environments.

`return_ids=False`

Added a new class CorpusIDsList that extends Python's native list, so that the returned types would be consistent. The name was chosen, following the original return values, for the coherence of the reading.

`tests/core/test_tokenizer_misc.py`

Added tests (test_truncation_of_large_corpus and test_truncation_of_small_corpus), in order to check the marker of truncation , ...] or ... (total.

Some more details

The reproduction code that I have used for the issue was as follows, using the examples in the original readme.

import bm25s
import Stemmer

corpus_very_small = [
    "a cat is a feline and likes to purr",
]

corpus_small = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
    # a line more than 10 tokens
    "a cat is a feline and likes to purr and the cat may like to jump very high still so carefully and casually",
]
stemmer = Stemmer.Stemmer("english")
corpus_large = []
for i in range(100000):
    corpus_large += corpus_small

type2corpus = {
    "VERY_SMALL": corpus_very_small,
    "SMALL": corpus_small,
    "LARGE": corpus_large,
}

for corpus_type, target_corpus in type2corpus.items():
    print("*"*30)
    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NONE | RETURN_IDS=TRUE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=None, return_ids=True)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NONE | RETURN_IDS=FALSE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=None, return_ids=False)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    stemmer = Stemmer.Stemmer("english")
    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NAIVE | RETURN_IDS=TRUE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=stemmer, return_ids=True)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

    print("-"*30)
    print(f"[CASE] CORPUS={corpus_type} | STEMMER=NAIVE | RETURN_IDS=FALSE")
    corpus_tokens = bm25s.tokenize(target_corpus, stopwords="en", stemmer=stemmer, return_ids=False)
    retriever = bm25s.BM25()
    retriever.index(corpus_tokens)
    print(corpus_tokens)

I have ran the local tests at tests/core, to confirm the added lines would not conflict with the original code base. If there are further improvements or alternative approaches you’d prefer, please let me know; I’m happy to edit or modify this further following your feedback, and I am looking forward to it.

xhluca · 2025-03-17T20:07:38Z

Hi I'm quite busy with a submission so won't have a chance to review this for the next 2 weeks. Will review after that!

emmanuel-stone · 2025-03-18T01:39:57Z

Thank you very much for the heads up! Please take time! I hope to have a chance to learn from your feedback, so looking forward to it :) Thank you again and hope that you have all the best luck for your submission.

xhluca

Can you remove CorpusIDsList? I do not believe it adds value. the issue #90 only applies to printing, rather than the actual returned object. There's no benefit to create a custom list object inherited from list as far as I can see in this scenario, but adds more complexity to the code. I prefer occam's razor here.

xhluca · 2025-04-08T04:37:59Z

bm25s/tokenization.py

@@ -634,5 +748,4 @@ def tokenize(
            )
        ):
            corpus_ids[i] = [reverse_dict[token_id] for token_id in token_ids]
-
-        return corpus_ids
+        return CorpusIDsList(corpus_ids)


This is not a good idea, because it changes the return signature and would thus create a breaking change. I also do not see the benefit of abstracting the returned object with another class. Can you revert it back to the original returned object and remove CorpusIDsList?

xhluca · 2025-04-08T04:38:12Z

bm25s/tokenization.py

+                7: [10, 11, 12, 13, 14]
+                8: [15, 16, 17, 18, 19]
+                9: [0, 1, 2, 3, 0, 20, 21, 22, 23, 24, ...]
+                ... (total 500000 docs)


This is a good idea: ... (total 500000 docs)

Copilot

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

bm25s/tokenization.py:110

[nitpick] Consider using the variable 'lines_print_max_num' instead of the hard-coded value '10' to maintain consistency in the truncation logic for displaying vocab items.

if len(list(vocab_keys)) > 10:

Copilot · 2025-04-08T04:43:45Z

tests/core/test_tokenizer_misc.py

+        corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False)
+        repr_corpus_ids = repr(corpus_ids)
+        self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))


In test_truncation_of_large_corpus, the assertion compares repr_corpus_ids to its own string conversion, which is always true. It should compare repr_corpus_ids to str(corpus_ids) to correctly validate the repr output.

Suggested change

self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))

self.assertEqual(repr_corpus_ids, str(corpus_ids))

Copilot · 2025-04-08T04:43:45Z

tests/core/test_tokenizer_misc.py

+        corpus_ids = bm25s.tokenize(corpus, stopwords="en", return_ids=False)
+        repr_corpus_ids = repr(corpus_ids)
+        self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))


In test_truncation_of_small_corpus, the assertion mistakenly compares repr_corpus_ids to its own string conversion instead of comparing it to str(corpus_ids). Adjusting this check will better verify the intended repr behavior.

Suggested change

self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))

self.assertEqual(repr_corpus_ids, str(corpus_ids))

emmanuel-stone · 2025-04-08T07:45:43Z

thank you for such a kind and helpful advice, I really appreciate it. I will try to read through the suggestion and try to follow the advice and the guidance. Please bear with me because I am rather a novice at OSS. Again, thank you very much!

xhluca · 2025-04-08T14:55:19Z

No worries, take your time!

emmanuel-stone · 2025-04-12T23:41:29Z

Thank you again, for taking the time, to review the code and to provide your valuable feedback. I’ve pushed two new commits to address your feedback on this pull request.

Specifically:

Removed CorpusIDsList and reverted the code to return the original list object, as you recommended.
Updated the tests to follow Copilot’s suggestion for comparing repr_corpus_ids with str(corpus_ids)

The added new commits are:

If there’s anything else you’d like me to refine or edit or clarify, please let me know. Thank you once again for your thoughtful feedback and support; I really appreciate it.

Looking forward to your thoughts,
with kind and warm regards, emmanuel

xhluca

LGTM!

Add __repr__ and customized list cls; to avoid accidentally huge output

e073d13

xhluca requested changes Apr 8, 2025

View reviewed changes

xhluca requested a review from Copilot April 8, 2025 04:42

Copilot AI reviewed Apr 8, 2025

View reviewed changes

emmanuel-stone added 2 commits April 12, 2025 23:34

Merge branch 'xhluca:main' into fix/truncated-printing-tokenized

889d80d

delete CorpusIDsList, revert the usage of (and the testing for) it

9c2a9ba

xhluca approved these changes Apr 15, 2025

View reviewed changes

xhluca merged commit e7690fa into xhluca:main Apr 15, 2025
2 checks passed

xhluca mentioned this pull request Apr 15, 2025

truncated printing by default #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated Printing of the Tokenization Results for issue #90 #121

Truncated Printing of the Tokenization Results for issue #90 #121

emmanuel-stone commented Mar 15, 2025 •

edited

Loading

xhluca commented Mar 17, 2025

emmanuel-stone commented Mar 18, 2025 •

edited

Loading

xhluca left a comment •

edited

Loading

xhluca Apr 8, 2025

xhluca Apr 8, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI Apr 8, 2025

Copilot AI Apr 8, 2025

emmanuel-stone commented Apr 8, 2025

xhluca commented Apr 8, 2025

emmanuel-stone commented Apr 12, 2025 •

edited

Loading

xhluca left a comment

	self.assertEqual(repr_corpus_ids, str(repr_corpus_ids))
	self.assertEqual(repr_corpus_ids, str(corpus_ids))

Truncated Printing of the Tokenization Results for issue #90 #121

Truncated Printing of the Tokenization Results for issue #90 #121

Conversation

emmanuel-stone commented Mar 15, 2025 • edited Loading

Summary

return_ids=True

return_ids=False

Details

bm25s/tokenization.py

return_ids=True

return_ids=False

tests/core/test_tokenizer_misc.py

Some more details

xhluca commented Mar 17, 2025

emmanuel-stone commented Mar 18, 2025 • edited Loading

xhluca left a comment • edited Loading

Choose a reason for hiding this comment

xhluca Apr 8, 2025

Choose a reason for hiding this comment

xhluca Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Apr 8, 2025

Choose a reason for hiding this comment

Copilot AI Apr 8, 2025

Choose a reason for hiding this comment

emmanuel-stone commented Apr 8, 2025

xhluca commented Apr 8, 2025

emmanuel-stone commented Apr 12, 2025 • edited Loading

xhluca left a comment

Choose a reason for hiding this comment

emmanuel-stone commented Mar 15, 2025 •

edited

Loading

`return_ids=True`

`return_ids=False`

`bm25s/tokenization.py`

`return_ids=True`

`return_ids=False`

`tests/core/test_tokenizer_misc.py`

emmanuel-stone commented Mar 18, 2025 •

edited

Loading

xhluca left a comment •

edited

Loading

xhluca Apr 8, 2025 •

edited

Loading

emmanuel-stone commented Apr 12, 2025 •

edited

Loading