Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read LoTTE docs #251

Open
ftvalentini opened this issue Nov 15, 2023 · 0 comments
Open

Cannot read LoTTE docs #251

ftvalentini opened this issue Nov 15, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@ftvalentini
Copy link

Describe the bug
There seems to be an issue when downloading/reading the lotte datasets.

Affected dataset(s)
LoTTE

To Reproduce
Run in Python:

import ir_datasets

dataset = ir_datasets.load("lotte/recreation/test")
for doc in dataset.docs_iter():
    print(doc)
    break

Get the error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/home/user/misc/ir-datasets.ipynb Cell 3 line 4
      1 import ir_datasets
      3 dataset = ir_datasets.load("lotte/recreation/test")
----> 4 for doc in dataset.docs_iter():
      5     print(doc)
      6     break

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/util/__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/formats/tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/formats/tsv.py:28, in FileLineIter.__next__(self)
     26         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()))
     27     else:
---> 28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
     30     line = self.stream.readline()

File ~/miniconda3/envs/py311/lib/python3.11/contextlib.py:502, in _BaseExitStack.enter_context(self, cm)
    499 except AttributeError:
    500     raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object does "
    501                     f"not support the context manager protocol") from None
--> 502 result = _enter(cm)
    503 self._push_cm_exit(cm, _exit)
    504 return result

File ~/miniconda3/envs/py311/lib/python3.11/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/util/fileio.py:148, in RelativePath.stream(self)
    146 @contextlib.contextmanager
    147 def stream(self):
--> 148     with open(self.path(), 'rb') as f:
    149         yield f

FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.ir_datasets/lotte/lotte_extracted/lotte/recreation/test/collection.tsv'

Expected behavior
I should be seeing the first doc in the collection, as I successfully get with msmarco:

dataset = ir_datasets.load("beir/msmarco/test")
for doc in dataset.docs_iter():
    print(doc)
    break

returns:

GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.')

Additional context
In the terminal, cd ~/.ir_datasets/lotte && ls -R . returns:

.:
lotte_extracted

./lotte_extracted:
lotte

./lotte_extracted/lotte:
lifestyle  recreation

./lotte_extracted/lotte/lifestyle:
test

./lotte_extracted/lotte/lifestyle/test:
collection.tsv.pklz4

./lotte_extracted/lotte/lifestyle/test/collection.tsv.pklz4:
bin  bin.meta

./lotte_extracted/lotte/recreation:
test

./lotte_extracted/lotte/recreation/test:
collection.tsv.pklz4

./lotte_extracted/lotte/recreation/test/collection.tsv.pklz4:
bin  bin.meta

I'm working with:

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.14.0

ir_datasets: 0.5.5

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 5.15.0-84-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit
@ftvalentini ftvalentini added the bug Something isn't working label Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant