Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

total_terms_in_collection != sum of doclengths in robust04 queries only #21

Open
cmacdonald opened this issue Mar 17, 2020 · 8 comments
Assignees

Comments

@cmacdonald
Copy link
Member

For Robust04 queries only, the sum of the doclens is 167686911, while total_terms_in_collection=174540872 in the ciff file. Why is it more? This affect the avgdoclength, and hence the BM25 scores

@lintool
Copy link
Member

lintool commented Mar 17, 2020

I've confirmed this. This could be due to Lucene's doclength approximation? Need to dig deeper into this though.

@lintool
Copy link
Member

lintool commented Mar 17, 2020

I wrote a simple program to probe into this, and this is indeed the case:

Total number of terms in collection (sum of doclengths):
Lossy: 167686911
Exact: 174540872

The IndexReader reports the exact value, but the sum of doclengths is based on Lucene lossy values.

So, not a bug, just requires better documentation. I will add documentation as appropriate.

@lintool lintool self-assigned this Mar 17, 2020
@cmacdonald
Copy link
Member Author

If we are asking people to use a total number of tokens, shouldnt it be accurate? The doclengths in the posting lists are accurate.

@lintool
Copy link
Member

lintool commented Mar 17, 2020

Well, I mean, the export is an accurate snapshot of the index?

Actually, the doclengths in the postings are the lossy approximates...

@cmacdonald
Copy link
Member Author

sorry, let me rephrase, are the doc lengths in the DocRecord part of the ciff file lossy approximates or accurate?

@lintool
Copy link
Member

lintool commented Mar 17, 2020

The doclengths recorded in the DocRecord messages are approximate/lossy/

@cmacdonald
Copy link
Member Author

Ok, I got it; so the doclengths in the DocRecord are approximate, but the # of total tokens is exact.

There is a question of explicability - are we trying to produce an index of record, or just try to reproduce Lucene/Anserini's index in our own systems? Only in the latter does it make sense to keep approximations in the CIFF.

The explanations in the CIFF standard shouldn't refer to any approximate values de facto, its an implementation choice of the Lucene/Anserini CIFF exporter to expose approximate values. I.e. a readme for the generated CIFF files.

@lintool
Copy link
Member

lintool commented Mar 17, 2020

Agreed on both accounts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants