Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create very simple export for testing purposes #12

Open
lintool opened this issue Mar 9, 2020 · 5 comments
Open

Create very simple export for testing purposes #12

lintool opened this issue Mar 9, 2020 · 5 comments

Comments

@lintool
Copy link
Member

lintool commented Mar 9, 2020

@JMMackenzie and @chriskamphuis have requested a sample export for testing purposes.

I propose exporting the index from this Anserini test case: https://github.com/castorini/anserini/blob/master/src/test/java/io/anserini/integration/TrecEndToEndTest.java

which indexes this 3 document toy collection: https://github.com/castorini/anserini/tree/master/src/test/resources/sample_docs/trec/collection2

sg?

@chriskamphuis
Copy link
Member

sounds good

@JMMackenzie
Copy link
Member

Perfect!

@lintool
Copy link
Member Author

lintool commented Mar 9, 2020

toy-complete-20200309.ciff.gz

Reading header...
=== Header === 
version: 1
num_postings_lists: 9
num_doc_records: 3
total_postings_lists: 9
total_docs: 3
total_terms_in_collection: 16
average_doclength: 5.333333
description: Export of toy 3-document collection from Anserini's io.anserini.integration.TrecEndToEndTest test case

Expecting 9 postings lists and 3 doc records in this export.
term: '01', df=1, cf=1 (0, 1)
term: '03', df=1, cf=1 (0, 1)
term: '30', df=1, cf=1 (0, 1)
term: 'content', df=1, cf=1 (0, 1)
term: 'enough', df=1, cf=1 (2, 1)
term: 'head', df=3, cf=3 (0, 1) (1, 1) (1, 1)
term: 'simpl', df=2, cf=2 (1, 1) (1, 1)
term: 'text', df=3, cf=5 (0, 1) (1, 1) (1, 3)
term: 'veri', df=1, cf=1 (1, 1)
0	WSJ_1	6
1	TREC_DOC_1	4
2	DOC222	6

@lintool
Copy link
Member Author

lintool commented Mar 10, 2020

TODO: encode above as a test case.

@cmacdonald
Copy link
Member

might be nice to have another file that demonstrates the "Query terms only" case, i.e. num_postings_lists < total_postings_lists, and other relevant statistics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants