-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image support to LLMTestCase #1210
Comments
Hey @ethanelasky does our https://docs.confident-ai.com/docs/evaluation-test-cases#mllm-test-case |
Thanks -- some suggestions
|
The docs also assert that |
Hey @ethanelasky thanks for the feedback. What do you mean MLLM test cases for text generation? Do you mean there's no support for pure text evaluation (?) The thing with MLLM test case is actually there's not used a lot as of now, except for the base64 strings, do you have any metrics/use cases in mind for MLLM evaluation? We have two MLLM metrics at the moment: https://docs.confident-ai.com/docs/metrics-text-to-image |
I'm looking for a similar feature set. Working with a corpus of PDFs and they have a good amount of figures that I wish to run evaluations over. Currently I have to use a VLM to extract text/summarize before giving it to a test case, but I'd like the test case to be able to incorporate the image directly. Adding image support to synthetic generation as well as the standard test case suite seems like a logical and natural extension of what you guys already have. |
@ethanelasky and @joslack, @kritinv very kindly pushed out some new metrics for this - can you see if it is useful as a first step? #1230 |
Love what you're getting at here, think the Suppose I have a multimodal RAG application over a corpus of HR and process documents. These documents often contain figures, images, or even images of other documents. As it stands, my search and retrieval process delivers some combination of both images and text to the context of my LLM to generate an answer. I could express this in langchain like: chat = ChatAnthropic(model='claude-3-haiku-20240307')
messages = [
HumanMessage(
content=[
{"type": "text", "text": "Analyze the following image and documents:"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
},
*[{"type": "text", "text": f"Document {i+1}: {doc}"}
for i, doc in enumerate(text_documents)]
]
)
]
response = chat.invoke(messages) I am interested in many of the metrics you provide, as well as the scaffolding around them to execute lots of them quickly, but I do not think the metrics provided currently support my use case very well. For instance, |
@joslack Hey, I understand your needs better now, and I agree that incorporating Multimodal RAG metrics seems like the next logical step for DeepEval. When you say "a degree removed," are you referring to adjustments needed in the DeepEval interface and pipeline, or are you focusing on the evaluation mechanism itself (e.g., avoiding image description as part of the retrieval context)? The reason I’m asking is that I think the evaluation mechanism should align closely with what you’re currently doing as a workaround. This approach follows the LLM-as-a-judge pattern, rather than relying on something like image embeddings for evaluation (which would make sense for retrieval but not for assessing the retrieval process itself). I’d be happy to help build this out once we get a clearer idea of your specific needs and expectations. |
❗BEFORE YOU BEGIN❗
Are you on discord? 🤗 We'd love to have you asking questions on discord instead: https://discord.com/invite/a3K9c8GRGt
Yes
Is your feature request related to a problem? Please describe.
Current evaluation doesn't allow for image insertion in VLMs.
Describe the solution you'd like
Permitting LLMTestCase to accept images as part of context.
Describe alternatives you've considered
Not adding images and only using text (doesn't work well for documents that require multimodal/VLM generations).
The text was updated successfully, but these errors were encountered: