Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unable to load yfcc-10M-filter-euclidean dataset #45

Open
2 tasks done
yudhiesh opened this issue Apr 29, 2024 · 1 comment
Open
2 tasks done

[Bug] Unable to load yfcc-10M-filter-euclidean dataset #45

yudhiesh opened this issue Apr 29, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@yudhiesh
Copy link

yudhiesh commented Apr 29, 2024

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I get the error FileNotFoundError: Dataset does not exist. Please check the path or dataset_id when trying to load the yfcc-10M-filter-euclidean dataset.

Expected Behavior

The dataset should be loaded as its available within list_datasets().

Steps To Reproduce

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets()
dataset_name =  "yfcc-10M-filter-euclidean"
assert dataset_name in datasets, "Dataset does not exists!"
dataset = load_dataset(dataset_name)

Relevant log output

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 load_dataset('yfcc-10M-filter-euclidean')

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/public.py:59, in load_dataset(dataset_id, **kwargs)
     57     raise FileNotFoundError(f"Dataset {dataset_id} not found in catalog")
     58 else:
---> 59     return Dataset.from_catalog(dataset_id, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:89, in Dataset.from_catalog(cls, dataset_id, catalog_base_path, **kwargs)
     83 catalog_base_path = (
     84     catalog_base_path
     85     if catalog_base_path
     86     else os.environ.get("DATASETS_CATALOG_BASEPATH", cfg.Storage.endpoint)
     87 )
     88 dataset_path = os.path.join(catalog_base_path, f"{dataset_id}")
---> 89 return cls(dataset_path=dataset_path, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:190, in Dataset.__init__(self, dataset_path, **kwargs)
    188     self._dataset_path = dataset_path
    189     if not self._fs.exists(self._dataset_path):
--> 190         raise FileNotFoundError(
    191             "Dataset does not exist. Please check the path or dataset_id"
    192         )
    193 else:
    194     self._fs = None

FileNotFoundError: Dataset does not exist. Please check the path or dataset_id

Environment

- **OS**: macOS 14.4.1
- **Language version**: Python 3.10.10
- **Pinecone client version**: 0.7.0

Additional Context

Looking at the metadata about the datasets

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets(as_df=True)
dataset_name =  "yfcc-10M-filter-euclidean"
datasets.query('name == @dataset_name').to_dict()

Results show that the data is not in the bucket:

{'name': {27: 'yfcc-10M-filter-euclidean'},
 'created_at': {27: '2023-08-24 13:51:29.136759'},
 'documents': {27: 10000000},
 'queries': {27: 100000},
 'source': {27: 'big-ann-challenge 2023'},
 'license': {27: None},
 'bucket': {27: None},
 'task': {27: None},
 'dense_model': {27: {'name': 'yfcc', 'tokenizer': None, 'dimension': 192}},
 'sparse_model': {27: None},
 'description': {27: 'Dataset from the 2023 big ann challenge - filter track. Distance: Euclidean. see https://big-ann-benchmarks.com/neurips23.html'},
 'tags': {27: None},
 'args': {27: None}}
@yudhiesh yudhiesh added the bug Something isn't working label Apr 29, 2024
@Zmasterx
Copy link

Hello, I'm also having the same issue, is this issue currently resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants