Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] HttpError : Invalid bucket name: 'wikipedia-simple-text-embedding-ada-002-100K', 400 #35

Open
2 tasks done
David-GERARD opened this issue Nov 21, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@David-GERARD
Copy link

David-GERARD commented Nov 21, 2023

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Hi,

I have used code from one of the example colab Notebook on RAG with langchain to make a lab for students on vector databases.

A minority of the students encountered the following error when importing the wikipedia-simple-text-embedding-ada-002-100K dataset from pinecone_datasets:
image
image (1)
image (2)

Expected Behavior

This cell is supposed to run and import the dataset (it works on my laptop and for most of the students).

Steps To Reproduce

In python 3.11 with the packages versions described later run pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K ')

Relevant log output

No response

Environment

- **OS**: multiple (Windows and MacOS)
- **Language version**: python 3.11
- **Pinecone client version**: pinecone_datasets==0.6.2

Additional Context

None of our troubleshooting attempts worked, and we have not identifier the common denominator that leads to this error happening. When using the list_datasets() method, the wikipedia-simple-text-embedding-ada-002-100K appears in the list, and we were thinking it might be a server side error.

@David-GERARD David-GERARD added the bug Something isn't working label Nov 21, 2023
@martinohanlon
Copy link

I have experienced the same issue.

This relates to https://community.pinecone.io/t/pinecone-datasets-httperror-invalid-bucket-name-wikipedia-simple-text-embedding-ada-002-100k-400/3715/3 .

Root cause is that the code is using os.path.join to create a gs file path and on Windows you get you get a \ e.g.

gs://catalog_base_path\dataset_id

The "dirty" fix is to modify this line of code https://github.com/pinecone-io/pinecone-datasets/blob/main/pinecone_datasets/dataset.py#L95

To

dataset_path = f"{catalog_base_path}/{dataset_id}"

But that wont work when the catalog_base_path is a local path.

@David-GERARD
Copy link
Author

Thanks @martinohanlon !

@martinohanlon
Copy link

@David-GERARD I dont think the issue should be close. It is a bug which should be fixed imo.

@David-GERARD David-GERARD reopened this Dec 1, 2023
Daethyra added a commit to Daethyra/Build-RAGAI that referenced this issue Dec 21, 2023
- src/llm_utilikit/LangChain/notebooks/langchain-embeddings-retrieval-agent.ipynb
I found a dirty fix, but don't know how to use it and am currently too lazy to find out.
- pinecone-io/pinecone-datasets#35
@captainkapnap
Copy link

@martinohanlon your solution worked however another error pops up afterwards.

C:\Users\xxx\AppData\Roaming\Python\Python311\site-packages\pinecone_datasets\dataset.py:280: UserWarning: WARNING: No data found at: gs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002/documents/*.parquet. Returning empty DF
warnings.warn(

Code in local Jupyter Notebook (Win10):

from pinecone_datasets import load_dataset, list_datasets
list_datasets()
dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')
dataset.head()

^--- modified from: https://docs.pinecone.io/docs/using-public-datasets

Exact code worked in Google colab notebook (@David-GERARD fyi)

@pdebuyer
Copy link

pdebuyer commented Apr 8, 2024

Hey. The dirtiest solution is to patch os.path.join at the beginning of datasets.py
os.path.join = lambda *s: "/".join(s)
This should fix your issue @captainkapnap

@reddgr
Copy link

reddgr commented Feb 15, 2025

I implemented a "dirty fix" inspired by @martinohanlon's comment. It essentially required changing multiple lines of dataset.py where "os.path.join" is used by an if-else block that constructs the paths with f-strings and forward slash characters in case the system's platform is Windows. For example:

        # save documents
        if platform.system() == "Windows":
            documents_path = f"{dataset_path}/documents"
            print(f"documents_path: {documents_path}")
        else:
            documents_path = os.path.join(dataset_path, "documents")
            if platform.system() == "Windows":
                parquet_path = f"{documents_path}/part-0.parquet"
                print(f"parquet_path: {parquet_path}")
            else:
                parquet_path = os.path.join(documents_path, "part-0.parquet")   

Here's my fork in case anyone trying to download a Pinecone dataset on Windows finds it useful:
https://github.com/reddgr/pinecone-datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants