Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar` #19

mattmazzola · 2024-02-14T23:55:07Z

I was debugging an issue with our model outputting empty responses for all questions and noticed the accuracy score was still 22% when I expected it should be 0%.

I debug further and found that for multi_choice questions there is a path that computes Levenshtein distance but doesn't guard against empty inputs meaning it would output a valid choice regardless.
(Likely the choice with the least amount of characters which would be the minimum edit distance, or first choice if all equal length)

MathVista/evaluation/calculate_score.py

Lines 45 to 51 in 82f68d0

    
           if extraction in options: 
        
               # convert option letter to text, e.g. "A" -> "text" 
        
               ind = options.index(extraction) 
        
               extraction = choices[ind] 
        
           else: 
        
               # select the most similar option 
        
               extraction = get_most_similar(extraction, choices)

MathVista/evaluation/calculate_score.py

Lines 14 to 20 in 82f68d0

    
           def get_most_similar(prediction, choices): 
        
               """ 
        
               Use the Levenshtein distance (or edit distance) to determine which of the choices is most similar to the given prediction 
        
               """ 
        
               distances = [distance(prediction, choice) for choice in choices] 
        
               ind = distances.index(min(distances)) 
        
               return choices[ind]

I also saw there was a questionable Exception handling when coercing the input value to a string.
It assigns an empty string and continues. I think it should exiting early and return None.
This assignment of empty string could further contribute to the issue above, for multiple choice problems where the extraction is not a string

MathVista/evaluation/calculate_score.py

Lines 30 to 36 in 82f68d0

    
           if isinstance(extraction, str): 
        
               extraction = extraction.strip() 
        
           else: 
        
               try: 
        
                   extraction = str(extraction) 
        
               except: 
        
                   extraction = ""

Video Demonstration

https://youtu.be/vj07WRvcLDw

The text was updated successfully, but these errors were encountered:

Fixes lupantech#19

mattmazzola added a commit to mattmazzola/MathVista that referenced this issue Feb 15, 2024

Return None when str coercion fails or when empty

37d5706

Fixes lupantech#19

mattmazzola linked a pull request Feb 15, 2024 that will close this issue

Return None when str coercion fails or when empty #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar` #19

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar` #19

mattmazzola commented Feb 14, 2024

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to get_most_similar #19

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to get_most_similar #19

Comments

mattmazzola commented Feb 14, 2024

Video Demonstration

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar` #19

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar` #19