total_terms_in_collection != sum of doclengths in robust04 queries only #21

cmacdonald · 2020-03-17T15:03:57Z

For Robust04 queries only, the sum of the doclens is 167686911, while total_terms_in_collection=174540872 in the ciff file. Why is it more? This affect the avgdoclength, and hence the BM25 scores

lintool · 2020-03-17T15:08:35Z

I've confirmed this. This could be due to Lucene's doclength approximation? Need to dig deeper into this though.

lintool · 2020-03-17T15:17:55Z

I wrote a simple program to probe into this, and this is indeed the case:

Total number of terms in collection (sum of doclengths):
Lossy: 167686911
Exact: 174540872

The IndexReader reports the exact value, but the sum of doclengths is based on Lucene lossy values.

So, not a bug, just requires better documentation. I will add documentation as appropriate.

cmacdonald · 2020-03-17T15:19:44Z

If we are asking people to use a total number of tokens, shouldnt it be accurate? The doclengths in the posting lists are accurate.

lintool · 2020-03-17T15:41:58Z

Well, I mean, the export is an accurate snapshot of the index?

Actually, the doclengths in the postings are the lossy approximates...

cmacdonald · 2020-03-17T15:43:18Z

sorry, let me rephrase, are the doc lengths in the DocRecord part of the ciff file lossy approximates or accurate?

lintool · 2020-03-17T15:49:42Z

The doclengths recorded in the DocRecord messages are approximate/lossy/

cmacdonald · 2020-03-17T16:05:23Z

Ok, I got it; so the doclengths in the DocRecord are approximate, but the # of total tokens is exact.

There is a question of explicability - are we trying to produce an index of record, or just try to reproduce Lucene/Anserini's index in our own systems? Only in the latter does it make sense to keep approximations in the CIFF.

The explanations in the CIFF standard shouldn't refer to any approximate values de facto, its an implementation choice of the Lucene/Anserini CIFF exporter to expose approximate values. I.e. a readme for the generated CIFF files.

lintool · 2020-03-17T16:09:58Z

Agreed on both accounts.

lintool self-assigned this Mar 17, 2020

lintool mentioned this issue Mar 20, 2020

ExtractDocumentLengths: prints out sum of doclengths, both lossy and lossless castorini/anserini#1040

Merged

JMMackenzie mentioned this issue Apr 7, 2021

Documentation improvements #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

total_terms_in_collection != sum of doclengths in robust04 queries only #21

total_terms_in_collection != sum of doclengths in robust04 queries only #21

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

total_terms_in_collection != sum of doclengths in robust04 queries only #21

total_terms_in_collection != sum of doclengths in robust04 queries only #21

Comments

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020

cmacdonald commented Mar 17, 2020

lintool commented Mar 17, 2020