Add image support to LLMTestCase #1210

ethanelasky · 2024-12-05T10:54:27Z

❗BEFORE YOU BEGIN❗
Are you on discord? 🤗 We'd love to have you asking questions on discord instead: https://discord.com/invite/a3K9c8GRGt
Yes

Is your feature request related to a problem? Please describe.
Current evaluation doesn't allow for image insertion in VLMs.

Describe the solution you'd like
Permitting LLMTestCase to accept images as part of context.

Describe alternatives you've considered
Not adding images and only using text (doesn't work well for documents that require multimodal/VLM generations).

penguine-ip · 2024-12-06T12:36:24Z

Hey @ethanelasky does our MLLMTestCase solve your problem?

https://docs.confident-ai.com/docs/evaluation-test-cases#mllm-test-case

ethanelasky · 2024-12-09T12:58:28Z

Thanks -- some suggestions

You should add compatibility for base64 strings, which all providers accept in place of actual jpg files/etc, and which format images are often stored in in the backend (to reduce query latency).
You might want to add a mention to MLLMTestCase to the Quick Summary section of https://docs.confident-ai.com/docs/evaluation-test-cases to alert people to its presence.

ethanelasky · 2024-12-09T12:58:34Z

@penguine-ip

ethanelasky · 2024-12-09T13:14:41Z

The docs also assert that MLLMTestCase can only accept multimodal test cases, which are pretty sparse and are currently not oriented towards text generation, so it might be helpful to create those test cases to cater towards that need

penguine-ip · 2024-12-09T17:19:02Z

Hey @ethanelasky thanks for the feedback. What do you mean MLLM test cases for text generation? Do you mean there's no support for pure text evaluation (?)

The thing with MLLM test case is actually there's not used a lot as of now, except for the base64 strings, do you have any metrics/use cases in mind for MLLM evaluation? We have two MLLM metrics at the moment: https://docs.confident-ai.com/docs/metrics-text-to-image

ethanelasky · 2024-12-09T17:32:29Z

Hey Jeffrey, happy to help!

Sure, the documentation has a warning written stating

which seems to exclude using text-based metrics for text+image->text modality (even though the generation is purely text). Independently, it seems like we can't use traditional retrieval metrics for text+image retrieval.

There are papers like (this)[https://arxiv.org/pdf/2411.16365] which cover this modality in more depth and generate evaluations for it, and embedding providers like Voyage and Cohere have recently come out with high-performing multimodal text+image embedding models to support this type of retrieval.

joslack · 2024-12-12T15:41:25Z

I'm looking for a similar feature set. Working with a corpus of PDFs and they have a good amount of figures that I wish to run evaluations over. Currently I have to use a VLM to extract text/summarize before giving it to a test case, but I'd like the test case to be able to incorporate the image directly. Adding image support to synthetic generation as well as the standard test case suite seems like a logical and natural extension of what you guys already have.

penguine-ip · 2024-12-17T16:23:46Z

@ethanelasky and @joslack, @kritinv very kindly pushed out some new metrics for this - can you see if it is useful as a first step? #1230

joslack · 2024-12-18T03:51:46Z

Love what you're getting at here, think the ImageCoherenceMetric is definitely valuable, have been evaluating image text extraction mostly based on vibes up to now. However, I think I may have been unclear in my initial description.

Suppose I have a multimodal RAG application over a corpus of HR and process documents. These documents often contain figures, images, or even images of other documents. As it stands, my search and retrieval process delivers some combination of both images and text to the context of my LLM to generate an answer. I could express this in langchain like:

chat = ChatAnthropic(model='claude-3-haiku-20240307')
    
messages = [
    HumanMessage(
        content=[
            {"type": "text", "text": "Analyze the following image and documents:"},
            
            {
                "type": "image_url", 
                "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
            },
            
            *[{"type": "text", "text": f"Document {i+1}: {doc}"} 
              for i, doc in enumerate(text_documents)]
        ]
    )
]
    
    response = chat.invoke(messages)

I am interested in many of the metrics you provide, as well as the scaffolding around them to execute lots of them quickly, but I do not think the metrics provided currently support my use case very well.

For instance,
I wish to directly evaluate the ContextualRecallMetric for a given expected_output and retrieval_context. To execute one such test case for my pipeline, I would need to first summarize/extract text from the images that I provided as retrieval_context before I could compare them to the expected_output using the LLMTestCase interface. However, I would much rather run an evaluation on the exact data I used to generate an output rather than a degree removed from it. Extending the MMLM test cases to metrics like contextual recall is my basic ask.

kritinv · 2024-12-18T07:34:18Z

@joslack Hey, I understand your needs better now, and I agree that incorporating Multimodal RAG metrics seems like the next logical step for DeepEval. When you say "a degree removed," are you referring to adjustments needed in the DeepEval interface and pipeline, or are you focusing on the evaluation mechanism itself (e.g., avoiding image description as part of the retrieval context)?

The reason I’m asking is that I think the evaluation mechanism should align closely with what you’re currently doing as a workaround. This approach follows the LLM-as-a-judge pattern, rather than relying on something like image embeddings for evaluation (which would make sense for retrieval but not for assessing the retrieval process itself).

I’d be happy to help build this out once we get a clearer idea of your specific needs and expectations.

joslack · 2024-12-18T14:59:34Z

I'm referring to the evaluation mechanism itself. For instance, I would prefer the "verdicts" in the ContextualRecallMetric to be generated with direct reference to an image rather than from a textual description of that image. Take the following complex and noisy image:

I would expect that my LLM cannot deliver a perfect and exhaustive textual representation of this image in such a way that I would trust the LLM-as-a-judge for arbitrary questions about it. I would expect to lose information in the summarization and extraction process that would be critical to generating coherent verdicts.
Does this clarify what I mean by a "degree removed"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image support to LLMTestCase #1210

Add image support to LLMTestCase #1210

ethanelasky commented Dec 5, 2024

penguine-ip commented Dec 6, 2024

ethanelasky commented Dec 9, 2024

ethanelasky commented Dec 9, 2024

ethanelasky commented Dec 9, 2024

penguine-ip commented Dec 9, 2024

ethanelasky commented Dec 9, 2024 •

edited

Loading

joslack commented Dec 12, 2024

penguine-ip commented Dec 17, 2024 •

edited

Loading

joslack commented Dec 18, 2024

kritinv commented Dec 18, 2024

joslack commented Dec 18, 2024 •

edited

Loading

Add image support to LLMTestCase #1210

Add image support to LLMTestCase #1210

Comments

ethanelasky commented Dec 5, 2024

penguine-ip commented Dec 6, 2024

ethanelasky commented Dec 9, 2024

ethanelasky commented Dec 9, 2024

ethanelasky commented Dec 9, 2024

penguine-ip commented Dec 9, 2024

ethanelasky commented Dec 9, 2024 • edited Loading

joslack commented Dec 12, 2024

penguine-ip commented Dec 17, 2024 • edited Loading

joslack commented Dec 18, 2024

kritinv commented Dec 18, 2024

joslack commented Dec 18, 2024 • edited Loading

ethanelasky commented Dec 9, 2024 •

edited

Loading

penguine-ip commented Dec 17, 2024 •

edited

Loading

joslack commented Dec 18, 2024 •

edited

Loading