Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same corpus, the other score is 0 #43

Open
rcy1122 opened this issue Jun 11, 2024 · 3 comments
Open

Same corpus, the other score is 0 #43

rcy1122 opened this issue Jun 11, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@rcy1122
Copy link

rcy1122 commented Jun 11, 2024

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London"
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.93729472 0. ]

But

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    # "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.]

The difference lies in the number of corpus elements. It should be incorrect, but I don't know why?

@rcy1122
Copy link
Author

rcy1122 commented Jun 11, 2024

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            --> idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps
Here
idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

If self.corpus_size = 2, freq=1, then idf will be equal to 0.

When using get_scores to calculate scores:

    def get_scores(self, query):
        """
        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
        this algorithm also adds a floor to the idf value of epsilon.
        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
        :param query:
        :return:
        """
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in query:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            print("self.idf.get(q) or 0 = ", self.idf.get(q) or 0)

           --> score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score

==> When calculating idf (self.idf.get(q) or 0) all will be equal to 0.

Is this correct?

@dorianbrown dorianbrown added the bug Something isn't working label Oct 8, 2024
@jameswu1991
Copy link

jameswu1991 commented Oct 31, 2024

it seems like the trigger condition is when freq * 2 = self.corpus_size. we're missing the +1 term in calculating idf, which would also prevent negative idf values

there seems to be an attempt to fix this missing +1 term here: #40
but the proposed code change mysteriously removes the freq term from calculation

shouldn't

- idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+ idf = math.log((self.corpus_size - freq + 0.5) / freq + 0.5) + 1)

be the right expression for calcuating idf?

@rcy1122
Copy link
Author

rcy1122 commented Nov 4, 2024

@jameswu1991 Hi, Do you want to edit this issue? #40 has not been merged yet.
I don't understand bm25 very well, but is it possible to calculate it using :
idf = math.log((self.corpus_size + 0.5) / (freq + 0.5)) ?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants