ConversationalRetrievalQAChain from CSV files and Pincone vector store #1942

Georg7 · 2023-07-12T11:23:53Z

Georg7
Jul 12, 2023

Hoping someone knows how this can be resolved.

I use the CSVLoader to upsert a CSV file to Pinecone. I add metadata to each document.

To query the document here is how I define the vector store:


async function pineconeVectorStore(uid, fileId) {
      try {
        const filter = {
          uid: { $eq: uid },
        };

        if (fileId) filter.fileId = { $eq: fileId };

        const vectorStore = await PineconeStore.fromExistingIndex(new OpenAIEmbeddings(), {
          pineconeIndex,
          filter,
        });

        return vectorStore;
      } catch (error) {
        console.error(error);
        return null;
      }
    }

I then use the ConversationalRetrievalQAChain to query agains the specific file.

 const chain = ConversationalRetrievalQAChain.fromLLM(streamingModel, vectorStore.asRetriever(), {
        returnSourceDocuments: true,
        questionGeneratorChainOptions: {
          llm: nonStreamingModel,
          retriever: vectorStore.asRetriever(),
        },
      });
      const res = await chain.call({ question, chat_history: thread });

For some reason, the vector store only appears to take a small number of documents (rows) into account (e.g. 4 rows from a 500+ row document).

Pincecone index has a 1536 dimension and 'consine' metric.

Has anyone been able to do this successfully or can point me in the right direction.

Georg7 · 2023-07-12T11:41:29Z

dosubot[bot]
bot Jul 12, 2023

Answer generated by a 🤖

Answer

From your description, it appears that you're having an issue with the Pinecone vector store in your LangChain application. Specifically, when you're trying to upsert a CSV file to Pinecone and query the document, it seems that only a small number of documents (rows) are being taken into account. You've mentioned that only 4 rows from a 500+ row document are being processed, even though your Pinecone index has a 1536 dimension and uses the 'cosine' metric.

One possible explanation for this issue could be related to the chunk size limit set in the addVectors method of the PineconeStore class. In the current setup, the chunk size is set to 50, which means that only 50 documents are being processed at a time. If there's an interruption or an error during the upsert operation, it might result in only a portion of your documents being added to the Pinecone index.

To address this issue, I suggest increasing the chunk size to 100, as recommended by Pinecone. This could potentially allow more documents to be processed at a time. Here's how you can adjust the chunk size:

// Pinecone recommends a limit of 100 vectors per upsert request
const chunkSize = 100; // Increase chunk size to 100
for (let i = 0; i < pineconeVectors.length; i += chunkSize) {
  const chunk = pineconeVectors.slice(i, i + chunkSize);
  await this.pineconeIndex.upsert({
    upsertRequest: {
      vectors: chunk,
      namespace: this.namespace,
    },
  });
}

Additionally, I recommend adding error handling to the upsert operation to catch any issues that might be occurring during the process. This can help identify if there are any problems that might be causing only a portion of your documents to be added to the Pinecone index. Here's how you can add error handling:

// Pinecone recommends a limit of 100 vectors per upsert request
const chunkSize = 50;
for (let i = 0; i < pineconeVectors.length; i += chunkSize) {
  const chunk = pineconeVectors.slice(i, i + chunkSize);
  try {
    await this.pineconeIndex.upsert({
      upsertRequest: {
        vectors: chunk,
        namespace: this.namespace,
      },
    });
  } catch (error) {
    console.error('Error during upsert operation:', error);
  }
}

Please note that this is just one potential solution based on the information provided. If this solution works for you and you believe it could be helpful to others, I encourage you to make a pull request to update the LangChain documentation. This way, other users who encounter the same issue can benefit from your experience.

Sources

langchain/src/vectorstores/pinecone.ts

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

2 replies

Georg7 Jul 18, 2023
Author

Great suggestions but that didn't work. I've tried many different CSV files and it only has 4 "rows" available. Here is how I upsert the CSV file:

const loader = new CSVLoader(filePath);

const docs = await loader.load();

 docs.forEach((doc) => {
          doc.metadata.uid = uid;
          doc.metadata.fileId = fileId;
          doc.metadata.fileName = fileName;
          doc.metadata.createdAt = Date.now();
});

// num docs and metadata inspected here shows as expected (e.g. 400 docs each with correct metadata)

 await PineconeStore.fromDocuments(docs, new OpenAIEmbeddings(), {
          pineconeIndex,
        });

Upsert is successful. Here is how I query the CSV file:

async function pineconeVectorStore(uid, fileId) {
      try {
        const filter = {
          uid: { $eq: uid },
        };

        if (fileId) filter.fileId = { $eq: fileId };

        const vectorStore = await PineconeStore.fromExistingIndex(new OpenAIEmbeddings(), {
          pineconeIndex,
          filter,
        });

        return vectorStore;
      } catch (error) {
        console.error(error);
        return null;
      }
    }

    let vectorStore = await pineconeVectorStore(uid, fileId); 

...

const streamingModel = new ChatOpenAI({
        streaming: true,
        modelName: "gpt-3.5-turbo",
        temperature: 0,
        callbacks: [
          {
            handleLLMNewToken(token) {
              streamedResponse += token;
              if (token.length > 0)
                realTimeDbRef.push().set({
                  text: token,
                });
            },
          },
        ],
      });

      const nonStreamingModel = new ChatOpenAI({});

      const chain = ConversationalRetrievalQAChain.fromLLM(streamingModel, vectorStore.asRetriever(), {
        returnSourceDocuments: true,
        questionGeneratorChainOptions: {
          llm: nonStreamingModel,
          retriever: vectorStore.asRetriever(),
        },
      });

      const res = await chain.call({ question, chat_history: thread });

...

Regardless of length or content of the CSV file, the the model only ever has 4 rows available. (e.g. Q: "How many listings are in this file?" A: "There are 4 listings in this file")

dosubot[bot] bot Jul 18, 2023

Answer generated by a 🤖

Answer

From your description, it appears that you're having trouble with the Pinecone vector store in the LangChain framework. Specifically, when you try to upsert a CSV file to Pinecone and query the document, only a small number of documents (rows) are being considered, regardless of the actual number of rows in the CSV file. You've tried different CSV files and even attempted to increase the chunk size in the addVectors method of the PineconeStore class, but to no avail.

Based on your description, it seems that the issue might be related to how you're handling the upsert operation in chunks. In your current implementation, you're not waiting for each chunk to be successfully upserted before moving on to the next one. This could potentially lead to only a portion of your data being stored.

Here's a potential solution: modify your addVectors method to ensure that all your data is upserted to Pinecone before the function returns. You can achieve this by using Promise.all to wait for all the promises to resolve. Here's how you can adjust your code:

async addVectors(
    vectors: number[][],
    documents: Document[],
    options?: { ids?: string[] } | string[]
  ) {
    // ... rest of your code ...

    const chunkSize = 50;
    const promises = [];
    for (let i = 0; i < pineconeVectors.length; i += chunkSize) {
      const chunk = pineconeVectors.slice(i, i + chunkSize);
      promises.push(this.pineconeIndex.upsert({
        upsertRequest: {
          vectors: chunk,
          namespace: this.namespace,
        },
      }));
    }
    await Promise.all(promises);
    return documentIds;
  }

This modification should ensure that each chunk of data is successfully upserted before the next one is processed, which should resolve your issue.

Please note that this is one potential solution based on the information provided. If this doesn't resolve your issue, or if your experience contradicts this solution, I recommend checking if you're using the latest version of the LangChain framework, as updates often include bug fixes and performance improvements.

If you find this solution to be correct and believe it's a bug that could impact other users, we encourage you to make a pull request with the fix. Your contribution would be greatly appreciated by the LangChain community.

Sources

langchain/src/vectorstores/pinecone.ts

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConversationalRetrievalQAChain from CSV files and Pincone vector store #1942

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

ConversationalRetrievalQAChain from CSV files and Pincone vector store #1942

Georg7 Jul 12, 2023

Replies: 1 comment · 2 replies

dosubot[bot] bot Jul 12, 2023

Answer

Sources

Georg7 Jul 18, 2023 Author

dosubot[bot] bot Jul 18, 2023

Answer

Sources

Georg7
Jul 12, 2023

Replies: 1 comment 2 replies

dosubot[bot]
bot Jul 12, 2023

Georg7 Jul 18, 2023
Author