Skip to content

Documents appears to be too short (ie 100 tokens or less). Please provide longer documents. #2083

Open
@ananthanarayanan431

Description

@ananthanarayanan431

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
A clear and concise description of what the bug is.
ValueError Traceback (most recent call last)
Cell In[38], line 19
12 generator = TestsetGenerator.from_langchain(
13 llm=generator_llm,
14 embedding_model=generator_embeddings,
15 )
17 query_distribution = default_query_distribution(generator_llm)
---> 19 testset = generator.generate_with_langchain_docs(
20 documents=chunks,
21 testset_size=10,
22 query_distribution=query_distribution,
23 )

File ~/Desktop/project/.venv/lib/python3.12/site-packages/ragas/testset/synthesizers/generate.py:164, in TestsetGenerator.generate_with_langchain_docs(self, documents, testset_size, transforms, transforms_llm, transforms_embedding_model, query_distribution, run_config, callbacks, with_debugging_logs, raise_exceptions)
159 raise ValueError(
160 """An embedding client was not provided. Provide an embedding through the transforms_embedding_model parameter. Alternatively you can provide your own transforms through the transforms parameter."""
161 )
163 if not transforms:
--> 164 transforms = default_transforms(
165 documents=list(documents),
166 llm=transforms_llm or self.llm,
167 embedding_model=transforms_embedding_model or self.embedding_model,
168 )
170 # convert the documents to Ragas nodes
...
161 "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
162 )
164 return transforms

ValueError: Documents appears to be too short (ie 100 tokens or less). Please provide longer documents.

Ragas version: 0.2.15
Python version: 3.12

Code to Reproduce

from ragas.llms.base import LangchainLLMWrapper
from ragas.embeddings.base import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import default_query_distribution

generator_llm = LangchainLLMWrapper(langchain_llm=ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(embeddings=OpenAIEmbeddings(model="text-embedding-3-small"))

generator = TestsetGenerator.from_langchain(
llm=generator_llm,
embedding_model=generator_embeddings,
)

query_distribution = default_query_distribution(generator_llm)

testset = generator.generate_with_langchain_docs(
documents=chunks,
testset_size=10,
query_distribution=query_distribution,
)

Error trace
ValueError

Expected behavior
Creation of test dataset

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule-testsetgenModule testset generation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions