Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialization with empty corpus results in ZeroDivisionError #36

Open
mattf opened this issue Dec 18, 2023 · 5 comments
Open

initialization with empty corpus results in ZeroDivisionError #36

mattf opened this issue Dec 18, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@mattf
Copy link

mattf commented Dec 18, 2023

rank-bm25==0.2.2

In [11]: rank_bm25.BM25(corpus=[])
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[11], line 1
----> 1 rank_bm25.BM25(corpus=[])

File .../lib/python3.11/site-packages/rank_bm25.py:27, in BM25.__init__(self, corpus, tokenizer)
     24 if tokenizer:
     25     corpus = self._tokenize_corpus(corpus)
---> 27 nd = self._initialize(corpus)
     28 self._calc_idf(nd)

File .../lib/python3.11/site-packages/rank_bm25.py:52, in BM25._initialize(self, corpus)
     48             nd[word] = 1
     50     self.corpus_size += 1
---> 52 self.avgdl = num_doc / self.corpus_size
     53 return nd

ZeroDivisionError: division by zero
@dorianbrown
Copy link
Owner

Why do you want to initialize with an empty corpus? An error seems like a good thing to do here in my opinion, but I'm curious to hear about your use case.

@dorianbrown dorianbrown added the enhancement New feature or request label Oct 8, 2024
@mattf
Copy link
Author

mattf commented Oct 8, 2024

iirc, developing a RAG pipeline that used hybrid search. rank-bm25 was buried under a few frameworks, app -> framework+ -> rank-bm25, making the error difficult to interpret and patch.

@dorianbrown
Copy link
Owner

In this case I think it might be best for me to implement a custom exception (ie EmptyCorpusError), and then it would be easier to catch with a try/except. Then your stack can address what to do with that exception. Does that sound useful?

Within the scope of this project I don't see any functional use cases for empty corpuses currently.

@mattf
Copy link
Author

mattf commented Oct 8, 2024

that sounds like a great solution.

@dorianbrown
Copy link
Owner

If you want, you take a look at this PR with the change: #45

It's pretty small, but probably a good to have an extra pair of eyes take a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants