Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

mattmazzola · 2024-03-03T19:04:31Z

I had been working more closely with this repo a few weeks ago and thought I would try to contribute some of the modifications back for others to benefit.

Issues

The installation and setup of repo wasn't explicitly specified. See How to use generate_responses.py? (References non-existing utilities module) #13 (comment)
The code in repo was still setup to use locally downloaded data, but data is now available HuggingFace
1. This has all splits, is easier to manage and abstracts problem from developer
The gpt.py model was using old version of openai library code
Bard, Claud libraries were supposed to be optional but were not

Solutions

Use .devcontainer to standardize development environment and dependency installation
Change evaluation files to all use dataset from HuggingFace
Update GPT file to use newer OpenAI library with environment variables for Azure OpenAI
Make imports of claude, openai and bard, dynamic only if that model type was chosen

Other Misc

Use proper logging with rich formatting
Use separate metrics calculation from logging
Use pandas DataFrame for metric printing in nicer tables (See below)
Add ability to limit all steps of evaluation (generate, extract, calculate) to max number of problems
1. Allow easier testing of functionality on small subsets
Remove duplicate definitions of get_chat_response (Fixes: Redundant implementations of get_chat_response #16)

Sample Output

Generate Responses

[18:09:52] INFO     [root] MathVista: Generating Responses - Start                                              
[18:09:52] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:10:01] INFO     [root] Creating new query...                                                                
[18:10:01] INFO     [root] Loading gpt-4-32k...                                                                 
[18:10:01] INFO     [root] Model loaded.                                                                        
[18:10:01] INFO     [root] Results already exist.                                                               
[18:10:01] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:10:01] WARNING  [root] Limiting number of problems to 20.                                                   
[18:10:01] INFO     [root] Number of test problems to run: 20                                                   
  0%|                                                                                    | 0/20 [00:00<?, ?it/s][18:10:01] DEBUG    [root] --------------------------------------------------------------                       
[18:10:01] DEBUG    [root] Generating response for problem: 1...                                            
[18:10:14] DEBUG    [root] Query:                                                                               
                    Question: When a spring does work on an object, we cannot find the work by simply           
                    multiplying the spring force by the object's displacement. The reason is that there is no   
                    one value for the force-it changes. However, we can split the displacement up into an       
                    infinite number of tiny parts and then approximate the force in each as being constant.     
                    Integration sums the work done in all those parts. Here we use the generic result of the    
                    integration.                                                                                
                                                                                                                
                    In Figure, a cumin canister of mass $m=0.40 \mathrm{~kg}$ slides across a horizontal        
                    frictionless counter with speed $v=0.50 \mathrm{~m} / \mathrm{s}$. It then runs into and    
                    compresses a spring of spring constant $k=750 \mathrm{~N} / \mathrm{m}$. When the canister  
                    is momentarily stopped by the spring, by what distance $d$ is the spring compressed?        
                    Hint: Please answer the question requiring a floating-point number with one decimal place   
                    and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.                               
                    Solution:                                                                                   
[18:10:14] DEBUG    [root] Response:                                                                            
                    The spring does work on the canister, bringing it to rest. The work done by the spring is   
                    equal to the kinetic energy of the canister before it hits the spring. The work done by the 
                    spring is given by the equation $W = \frac{1}{2}kx^2$, where $x$ is the distance the spring 
                    is compressed. The kinetic energy of the canister is given by the equation $KE =            
                    \frac{1}{2}mv^2$. Setting these two equal to each other gives:                              
                                                                                                                
                    $\frac{1}{2}kx^2 = \frac{1}{2}mv^2$                                                         
                                                                                                                
                    Solving for $x$ gives:                                                                      
                                                                                                                
                    $x = \sqrt{\frac{mv^2}{k}}$                                                                 
                                                                                                                
                    Substituting the given values gives:                                                        
                                                                                                                
                    $x = \sqrt{\frac{(0.40 \mathrm{~kg})(0.50 \mathrm{~m/s})^2}{750 \mathrm{~N/m}}}$            
                                                                                                                
                    $x = 0.01 \mathrm{~m}$                                                                      
                                                                                                                
                    So, the spring is compressed by a distance of 0.01 m or 1.0 cm.                             
  5%|███▊                                                                        | 1/20 [00:13<04:08, 13.05s/it][18:10:14] DEBUG    [root] --------------------------------------------------------------                       
[18:10:14] DEBUG    [root] Generating response for problem: 2...      
...
[18:11:18] DEBUG    [root] Query:                                                                               
                    Question: Is the sum of smallest two bar is greater then the largest bar?                   
                    Choices:                                                                                    
                    (A) Yes                                                                                     
                    (B) No                                                                                      
                    Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at
                    the end.                                                                                    
                    Solution:                                                                                   
[18:11:18] DEBUG    [root] Response:                                                                            
                    The question does not provide enough information for a solution. It refers to "bars" but    
                    does not specify their sizes or quantities.                                                 
[18:11:18] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [01:17<00:00,  3.89s/it]
[18:11:18] INFO     [root] MathVista: Generating Responses - Finish

Extract Answer

[18:16:09] INFO     [root] MathVista: Extract Answers - Start                                                   
[18:16:09] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:16:09] INFO     [root] Number of test problems to run: 20                                                   
 95%|███████████████████████████████████████████████████████████████████████▎   | 19/20 [00:35<00:02,  2.91s/it][18:16:46] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [00:37<00:00,  1.86s/it]
[18:16:46] INFO     [root] MathVista: Extract Answers - Finish

Calculate Score

[18:21:17] INFO     [root] MathVista: Calculating Scores - Start                                                
[18:21:17] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:21:25] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:21:25] INFO     [root] Number of testing problems: 20                                                       
[18:21:25] INFO     [root] For each problem normalize extractions and get True False value                      
100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 34735.44it/s]
[18:21:25] INFO     [root] Calculate the average accuracy                                                       
100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 353949.70it/s]
/workspaces/MathVista/evaluation/calculate_score.py:249: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  values += results_df[key][i]
[18:21:25] INFO     [root] Correct: 8/20 - Accuracy: 40.00%                                                     
                    ========================================                                                    
                                                                                                                
                    question_type                                                                               
                    ========================================                                                    
                                 Accuracy Correct/Total                                                         
                    multi_choice   61.54%        (8/13)                                                         
                    free_form       0.00%         (0/7)                                                         
                                                                                                                
                    answer_type                                                                                 
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    text      61.54%        (8/13)                                                              
                    float      0.00%         (0/1)                                                              
                    integer    0.00%         (0/6)                                                              
                                                                                                                
                    language                                                                                    
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    chinese   66.67%         (2/3)                                                              
                    english   35.29%        (6/17)                                                              
                                                                                                                
                    source                                                                                      
                    ========================================                                                    
                                Accuracy Correct/Total                                                          
                    UniGeo       100.00%         (1/1)                                                          
                    Super-CLEVR  100.00%         (3/3)                                                          
                    TQA          100.00%         (1/1)                                                          
                    ScienceQA    100.00%         (1/1)                                                          
                    GeoQA+        66.67%         (2/3)                                                          
                    SciBench       0.00%         (0/1)                                                          
                    TextVQA        0.00%         (0/2)                                                          
                    CLEVR-Math     0.00%         (0/2)                                                          
                    Geometry3K     0.00%         (0/1)                                                          
                    IconQA         0.00%         (0/1)                                                          
                    IQTest         0.00%         (0/1)                                                          
                    DVQA           0.00%         (0/2)                                                          
                    ChartQA        0.00%         (0/1)                                                          
                                                                                                                
                    category                                                                                    
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    general-vqa         50.00%        (5/10)                                                    
                    math-targeted-vqa   30.00%        (3/10)                                                    
                                                                                                                
                    task                                                                                        
                    ========================================                                                    
                                                Accuracy Correct/Total                                          
                    textbook question answering   66.67%         (2/3)                                          
                    visual question answering     60.00%         (3/5)                                          
                    geometry problem solving      60.00%         (3/5)                                          
                    math word problem              0.00%         (0/3)                                          
                    figure question answering      0.00%         (0/4)                                          
                                                                                                                
                    context                                                                                     
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    geometry diagram    60.00%         (3/5)                                                    
                    synthetic scene     60.00%         (3/5)                                                    
                    scientific figure   50.00%         (1/2)                                                    
                    natural image       33.33%         (1/3)                                                    
                    abstract scene       0.00%         (0/1)                                                    
                    puzzle test          0.00%         (0/1)                                                    
                    bar chart            0.00%         (0/3)                                                    
                                                                                                                
                    grade                                                                                       
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    high school         66.67%         (4/6)                                                    
                    daily life          37.50%         (3/8)                                                    
                    elementary school   20.00%         (1/5)                                                    
                    college              0.00%         (0/1)                                                    
                                                                                                                
                    skills                                                                                      
                    ========================================                                                    
                                          Accuracy Correct/Total                                                
                    scientific reasoning    66.67%         (2/3)                                                
                    algebraic reasoning     60.00%         (3/5)                                                
                    geometry reasoning      50.00%         (3/6)                                                
                    arithmetic reasoning    42.86%         (3/7)                                                
                    statistical reasoning    0.00%         (0/3)                                                
                    numeric commonsense      0.00%         (0/3)                                                
                    logical reasoning        0.00%         (0/1)                                                
                                                                                                                
[18:21:25] INFO     [root] Saved scores to: _results/eval/mathvista/gpt4/debug/gpt4_metric.json                 
[18:21:25] INFO     [root] MathVista: Calculating Scores - Finish

mattmazzola added 30 commits February 12, 2024 17:02

Add originals

46b4818

Add pyproject to control formatting

cbc0f99

Add more originals

9742258

Update utils

e07b557

Remove duplicate GPT from util and format

d13937b

Update GPT models

7766da3

Update generate_response

2b34cea

Update extract_answer

f1ce740

Update calculate score

ee2c283

Update extract to save after loop exit

0253565

Add save_every to generate_response

428eca6

Add llava model with debugging of images, and use HF dataset

6e9aaba

Merge branch 'main' into mattm/willow

ae27f29

add max_num_problems, HF dataset loading, and logging to files

ead3afe

Add query string formatting

8edab4e

Add editorconfig, gitattributes, and flake8

c9fea23

Add .devcontainer

3350606

Update .gitignore

71bbf03

Add pyproject.toml

4402408

Update devcontainer

3021ffa

Remove original copies

71a5508

format imports

5e32f58

Add levenshtein package

2fdb22c

Add debug configs

f4935be

Update llava with seed arg

2dd342e

Update GPT to use images if specified

ce7a297

format file

15e8e10

Update generate_responses

caa663b

Update extract answer

791e8d9

REmove unused values

f5549cb

mattmazzola added 7 commits March 3, 2024 18:25

Remove original

f4d1225

Update calculate score

6263255

move import

6cc5cb4

Remove combined metrics function

d4a9f90

Remove llava

47490f4

Add note about requirements.txt

54301ce

Update log

a1a95ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

mattmazzola commented Mar 3, 2024 •

edited

Loading

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

Are you sure you want to change the base?

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

Conversation

mattmazzola commented Mar 3, 2024 • edited Loading

Issues

Solutions

Other Misc

Sample Output

Generate Responses

Extract Answer

Calculate Score

mattmazzola commented Mar 3, 2024 •

edited

Loading