Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image support to LLMTestCase #1210

Open
ethanelasky opened this issue Dec 5, 2024 · 11 comments
Open

Add image support to LLMTestCase #1210

ethanelasky opened this issue Dec 5, 2024 · 11 comments

Comments

@ethanelasky
Copy link

❗BEFORE YOU BEGIN❗
Are you on discord? 🤗 We'd love to have you asking questions on discord instead: https://discord.com/invite/a3K9c8GRGt
Yes

Is your feature request related to a problem? Please describe.
Current evaluation doesn't allow for image insertion in VLMs.

Describe the solution you'd like
Permitting LLMTestCase to accept images as part of context.

Describe alternatives you've considered
Not adding images and only using text (doesn't work well for documents that require multimodal/VLM generations).

@penguine-ip
Copy link
Contributor

Hey @ethanelasky does our MLLMTestCase solve your problem?

https://docs.confident-ai.com/docs/evaluation-test-cases#mllm-test-case

@ethanelasky
Copy link
Author

Thanks -- some suggestions

  1. You should add compatibility for base64 strings, which all providers accept in place of actual jpg files/etc, and which format images are often stored in in the backend (to reduce query latency).
  2. You might want to add a mention to MLLMTestCase to the Quick Summary section of https://docs.confident-ai.com/docs/evaluation-test-cases to alert people to its presence.

@ethanelasky
Copy link
Author

@penguine-ip

@ethanelasky
Copy link
Author

The docs also assert that MLLMTestCase can only accept multimodal test cases, which are pretty sparse and are currently not oriented towards text generation, so it might be helpful to create those test cases to cater towards that need

@penguine-ip
Copy link
Contributor

Hey @ethanelasky thanks for the feedback. What do you mean MLLM test cases for text generation? Do you mean there's no support for pure text evaluation (?)

The thing with MLLM test case is actually there's not used a lot as of now, except for the base64 strings, do you have any metrics/use cases in mind for MLLM evaluation? We have two MLLM metrics at the moment: https://docs.confident-ai.com/docs/metrics-text-to-image

@ethanelasky
Copy link
Author

ethanelasky commented Dec 9, 2024

Hey Jeffrey, happy to help!

Sure, the documentation has a warning written stating
image
which seems to exclude using text-based metrics for text+image->text modality (even though the generation is purely text). Independently, it seems like we can't use traditional retrieval metrics for text+image retrieval.

There are papers like (this)[https://arxiv.org/pdf/2411.16365] which cover this modality in more depth and generate evaluations for it, and embedding providers like Voyage and Cohere have recently come out with high-performing multimodal text+image embedding models to support this type of retrieval.

@joslack
Copy link

joslack commented Dec 12, 2024

I'm looking for a similar feature set. Working with a corpus of PDFs and they have a good amount of figures that I wish to run evaluations over. Currently I have to use a VLM to extract text/summarize before giving it to a test case, but I'd like the test case to be able to incorporate the image directly. Adding image support to synthetic generation as well as the standard test case suite seems like a logical and natural extension of what you guys already have.

@penguine-ip
Copy link
Contributor

penguine-ip commented Dec 17, 2024

@ethanelasky and @joslack, @kritinv very kindly pushed out some new metrics for this - can you see if it is useful as a first step? #1230

@joslack
Copy link

joslack commented Dec 18, 2024

Love what you're getting at here, think the ImageCoherenceMetric is definitely valuable, have been evaluating image text extraction mostly based on vibes up to now. However, I think I may have been unclear in my initial description.

Suppose I have a multimodal RAG application over a corpus of HR and process documents. These documents often contain figures, images, or even images of other documents. As it stands, my search and retrieval process delivers some combination of both images and text to the context of my LLM to generate an answer. I could express this in langchain like:

chat = ChatAnthropic(model='claude-3-haiku-20240307')
    
messages = [
    HumanMessage(
        content=[
            {"type": "text", "text": "Analyze the following image and documents:"},
            
            {
                "type": "image_url", 
                "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
            },
            
            *[{"type": "text", "text": f"Document {i+1}: {doc}"} 
              for i, doc in enumerate(text_documents)]
        ]
    )
]
    
    response = chat.invoke(messages)

I am interested in many of the metrics you provide, as well as the scaffolding around them to execute lots of them quickly, but I do not think the metrics provided currently support my use case very well.

For instance,
I wish to directly evaluate the ContextualRecallMetric for a given expected_output and retrieval_context. To execute one such test case for my pipeline, I would need to first summarize/extract text from the images that I provided as retrieval_context before I could compare them to the expected_output using the LLMTestCase interface. However, I would much rather run an evaluation on the exact data I used to generate an output rather than a degree removed from it. Extending the MMLM test cases to metrics like contextual recall is my basic ask.

@kritinv
Copy link
Collaborator

kritinv commented Dec 18, 2024

@joslack Hey, I understand your needs better now, and I agree that incorporating Multimodal RAG metrics seems like the next logical step for DeepEval. When you say "a degree removed," are you referring to adjustments needed in the DeepEval interface and pipeline, or are you focusing on the evaluation mechanism itself (e.g., avoiding image description as part of the retrieval context)?

The reason I’m asking is that I think the evaluation mechanism should align closely with what you’re currently doing as a workaround. This approach follows the LLM-as-a-judge pattern, rather than relying on something like image embeddings for evaluation (which would make sense for retrieval but not for assessing the retrieval process itself).

I’d be happy to help build this out once we get a clearer idea of your specific needs and expectations.

@joslack
Copy link

joslack commented Dec 18, 2024

I'm referring to the evaluation mechanism itself. For instance, I would prefer the "verdicts" in the ContextualRecallMetric to be generated with direct reference to an image rather than from a textual description of that image. Take the following complex and noisy image:
image
I would expect that my LLM cannot deliver a perfect and exhaustive textual representation of this image in such a way that I would trust the LLM-as-a-judge for arbitrary questions about it. I would expect to lose information in the summarization and extraction process that would be critical to generating coherent verdicts.
Does this clarify what I mean by a "degree removed"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants