Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets #22

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

mattmazzola
Copy link
Contributor

@mattmazzola mattmazzola commented Mar 3, 2024

I had been working more closely with this repo a few weeks ago and thought I would try to contribute some of the modifications back for others to benefit.

Issues

  1. The installation and setup of repo wasn't explicitly specified. See How to use generate_responses.py? (References non-existing utilities module) #13 (comment)
  2. The code in repo was still setup to use locally downloaded data, but data is now available HuggingFace
    1. This has all splits, is easier to manage and abstracts problem from developer
  3. The gpt.py model was using old version of openai library code
  4. Bard, Claud libraries were supposed to be optional but were not

Solutions

  1. Use .devcontainer to standardize development environment and dependency installation
  2. Change evaluation files to all use dataset from HuggingFace
  3. Update GPT file to use newer OpenAI library with environment variables for Azure OpenAI
  4. Make imports of claude, openai and bard, dynamic only if that model type was chosen

Other Misc

  1. Use proper logging with rich formatting
  2. Use separate metrics calculation from logging
  3. Use pandas DataFrame for metric printing in nicer tables (See below)
  4. Add ability to limit all steps of evaluation (generate, extract, calculate) to max number of problems
    1. Allow easier testing of functionality on small subsets
  5. Remove duplicate definitions of get_chat_response (Fixes: Redundant implementations of get_chat_response #16)

Sample Output

Generate Responses

[18:09:52] INFO     [root] MathVista: Generating Responses - Start                                              
[18:09:52] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:10:01] INFO     [root] Creating new query...                                                                
[18:10:01] INFO     [root] Loading gpt-4-32k...                                                                 
[18:10:01] INFO     [root] Model loaded.                                                                        
[18:10:01] INFO     [root] Results already exist.                                                               
[18:10:01] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:10:01] WARNING  [root] Limiting number of problems to 20.                                                   
[18:10:01] INFO     [root] Number of test problems to run: 20                                                   
  0%|                                                                                    | 0/20 [00:00<?, ?it/s][18:10:01] DEBUG    [root] --------------------------------------------------------------                       
[18:10:01] DEBUG    [root] Generating response for problem: 1...                                            
[18:10:14] DEBUG    [root] Query:                                                                               
                    Question: When a spring does work on an object, we cannot find the work by simply           
                    multiplying the spring force by the object's displacement. The reason is that there is no   
                    one value for the force-it changes. However, we can split the displacement up into an       
                    infinite number of tiny parts and then approximate the force in each as being constant.     
                    Integration sums the work done in all those parts. Here we use the generic result of the    
                    integration.                                                                                
                                                                                                                
                    In Figure, a cumin canister of mass $m=0.40 \mathrm{~kg}$ slides across a horizontal        
                    frictionless counter with speed $v=0.50 \mathrm{~m} / \mathrm{s}$. It then runs into and    
                    compresses a spring of spring constant $k=750 \mathrm{~N} / \mathrm{m}$. When the canister  
                    is momentarily stopped by the spring, by what distance $d$ is the spring compressed?        
                    Hint: Please answer the question requiring a floating-point number with one decimal place   
                    and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.                               
                    Solution:                                                                                   
[18:10:14] DEBUG    [root] Response:                                                                            
                    The spring does work on the canister, bringing it to rest. The work done by the spring is   
                    equal to the kinetic energy of the canister before it hits the spring. The work done by the 
                    spring is given by the equation $W = \frac{1}{2}kx^2$, where $x$ is the distance the spring 
                    is compressed. The kinetic energy of the canister is given by the equation $KE =            
                    \frac{1}{2}mv^2$. Setting these two equal to each other gives:                              
                                                                                                                
                    $\frac{1}{2}kx^2 = \frac{1}{2}mv^2$                                                         
                                                                                                                
                    Solving for $x$ gives:                                                                      
                                                                                                                
                    $x = \sqrt{\frac{mv^2}{k}}$                                                                 
                                                                                                                
                    Substituting the given values gives:                                                        
                                                                                                                
                    $x = \sqrt{\frac{(0.40 \mathrm{~kg})(0.50 \mathrm{~m/s})^2}{750 \mathrm{~N/m}}}$            
                                                                                                                
                    $x = 0.01 \mathrm{~m}$                                                                      
                                                                                                                
                    So, the spring is compressed by a distance of 0.01 m or 1.0 cm.                             
  5%|███▊                                                                        | 1/20 [00:13<04:08, 13.05s/it][18:10:14] DEBUG    [root] --------------------------------------------------------------                       
[18:10:14] DEBUG    [root] Generating response for problem: 2...      
...
[18:11:18] DEBUG    [root] Query:                                                                               
                    Question: Is the sum of smallest two bar is greater then the largest bar?                   
                    Choices:                                                                                    
                    (A) Yes                                                                                     
                    (B) No                                                                                      
                    Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at
                    the end.                                                                                    
                    Solution:                                                                                   
[18:11:18] DEBUG    [root] Response:                                                                            
                    The question does not provide enough information for a solution. It refers to "bars" but    
                    does not specify their sizes or quantities.                                                 
[18:11:18] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [01:17<00:00,  3.89s/it]
[18:11:18] INFO     [root] MathVista: Generating Responses - Finish  

Extract Answer

[18:16:09] INFO     [root] MathVista: Extract Answers - Start                                                   
[18:16:09] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:16:09] INFO     [root] Number of test problems to run: 20                                                   
 95%|███████████████████████████████████████████████████████████████████████▎   | 19/20 [00:35<00:02,  2.91s/it][18:16:46] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [00:37<00:00,  1.86s/it]
[18:16:46] INFO     [root] MathVista: Extract Answers - Finish

Calculate Score

[18:21:17] INFO     [root] MathVista: Calculating Scores - Start                                                
[18:21:17] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:21:25] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:21:25] INFO     [root] Number of testing problems: 20                                                       
[18:21:25] INFO     [root] For each problem normalize extractions and get True False value                      
100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 34735.44it/s]
[18:21:25] INFO     [root] Calculate the average accuracy                                                       
100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 353949.70it/s]
/workspaces/MathVista/evaluation/calculate_score.py:249: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  values += results_df[key][i]
[18:21:25] INFO     [root] Correct: 8/20 - Accuracy: 40.00%                                                     
                    ========================================                                                    
                                                                                                                
                    question_type                                                                               
                    ========================================                                                    
                                 Accuracy Correct/Total                                                         
                    multi_choice   61.54%        (8/13)                                                         
                    free_form       0.00%         (0/7)                                                         
                                                                                                                
                    answer_type                                                                                 
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    text      61.54%        (8/13)                                                              
                    float      0.00%         (0/1)                                                              
                    integer    0.00%         (0/6)                                                              
                                                                                                                
                    language                                                                                    
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    chinese   66.67%         (2/3)                                                              
                    english   35.29%        (6/17)                                                              
                                                                                                                
                    source                                                                                      
                    ========================================                                                    
                                Accuracy Correct/Total                                                          
                    UniGeo       100.00%         (1/1)                                                          
                    Super-CLEVR  100.00%         (3/3)                                                          
                    TQA          100.00%         (1/1)                                                          
                    ScienceQA    100.00%         (1/1)                                                          
                    GeoQA+        66.67%         (2/3)                                                          
                    SciBench       0.00%         (0/1)                                                          
                    TextVQA        0.00%         (0/2)                                                          
                    CLEVR-Math     0.00%         (0/2)                                                          
                    Geometry3K     0.00%         (0/1)                                                          
                    IconQA         0.00%         (0/1)                                                          
                    IQTest         0.00%         (0/1)                                                          
                    DVQA           0.00%         (0/2)                                                          
                    ChartQA        0.00%         (0/1)                                                          
                                                                                                                
                    category                                                                                    
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    general-vqa         50.00%        (5/10)                                                    
                    math-targeted-vqa   30.00%        (3/10)                                                    
                                                                                                                
                    task                                                                                        
                    ========================================                                                    
                                                Accuracy Correct/Total                                          
                    textbook question answering   66.67%         (2/3)                                          
                    visual question answering     60.00%         (3/5)                                          
                    geometry problem solving      60.00%         (3/5)                                          
                    math word problem              0.00%         (0/3)                                          
                    figure question answering      0.00%         (0/4)                                          
                                                                                                                
                    context                                                                                     
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    geometry diagram    60.00%         (3/5)                                                    
                    synthetic scene     60.00%         (3/5)                                                    
                    scientific figure   50.00%         (1/2)                                                    
                    natural image       33.33%         (1/3)                                                    
                    abstract scene       0.00%         (0/1)                                                    
                    puzzle test          0.00%         (0/1)                                                    
                    bar chart            0.00%         (0/3)                                                    
                                                                                                                
                    grade                                                                                       
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    high school         66.67%         (4/6)                                                    
                    daily life          37.50%         (3/8)                                                    
                    elementary school   20.00%         (1/5)                                                    
                    college              0.00%         (0/1)                                                    
                                                                                                                
                    skills                                                                                      
                    ========================================                                                    
                                          Accuracy Correct/Total                                                
                    scientific reasoning    66.67%         (2/3)                                                
                    algebraic reasoning     60.00%         (3/5)                                                
                    geometry reasoning      50.00%         (3/6)                                                
                    arithmetic reasoning    42.86%         (3/7)                                                
                    statistical reasoning    0.00%         (0/3)                                                
                    numeric commonsense      0.00%         (0/3)                                                
                    logical reasoning        0.00%         (0/1)                                                
                                                                                                                
[18:21:25] INFO     [root] Saved scores to: _results/eval/mathvista/gpt4/debug/gpt4_metric.json                 
[18:21:25] INFO     [root] MathVista: Calculating Scores - Finish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redundant implementations of get_chat_response
1 participant