Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpt math solver #991

Draft
wants to merge 136 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
7dc4035
handle format error in message in _construct_params
yiranwu0 Apr 11, 2023
83ff983
fix typo
yiranwu0 Apr 12, 2023
ab2cada
Add math solver with automatic tool queries.
yiranwu0 Apr 16, 2023
2d70c99
add imports in QueryHandler
yiranwu0 Apr 16, 2023
c823cbf
update math solver
yiranwu0 Apr 23, 2023
766b022
require wolfram id in readme
yiranwu0 Apr 23, 2023
84ba0be
Merge branch 'main' into gpt_math_solver
yiranwu0 Apr 23, 2023
8f67ed7
fix bug in running python code
yiranwu0 Apr 23, 2023
a511f0a
Update flaml/autogen/math_solver/MathSolver.py
yiranwu0 Apr 23, 2023
87ad79d
Update flaml/autogen/math_solver/README.md
yiranwu0 Apr 23, 2023
a16fa5f
revise according to comments
yiranwu0 Apr 23, 2023
e21fd76
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 Apr 23, 2023
45dcb7f
fix code format
yiranwu0 Apr 23, 2023
435e7a4
Add prompt to system message
yiranwu0 Apr 23, 2023
d1747cf
refrtor file names
yiranwu0 Apr 24, 2023
56627a7
refine prompts
yiranwu0 Apr 24, 2023
9821820
add baseline PoT
yiranwu0 Apr 24, 2023
e37ee3e
fix bugs in query_handler
yiranwu0 Apr 24, 2023
5d44e5e
refine prompts
yiranwu0 Apr 24, 2023
bab2878
refine prompt to output fractions
yiranwu0 Apr 24, 2023
d0b0d4b
change prompt
yiranwu0 Apr 24, 2023
3e171a3
add temperature as args
yiranwu0 Apr 24, 2023
2261c5c
fix concat float to str
yiranwu0 Apr 24, 2023
8c5a86c
change prompt back to use fractions instead of decimal
yiranwu0 Apr 24, 2023
2b8b717
rewind prompt back to e37ee3
yiranwu0 Apr 25, 2023
8b68ff7
pass args.samples_per_category in PoT
yiranwu0 Apr 25, 2023
54407a7
fix counting bug in PoT and print in mth_solver
yiranwu0 Apr 25, 2023
4806631
fix error: convet exception to str
yiranwu0 Apr 25, 2023
80a7063
add logger to log stdouts and compress files
yiranwu0 Apr 25, 2023
d737644
refine logging
yiranwu0 Apr 25, 2023
d146e35
add option to put prompt in either system or user message, add option…
yiranwu0 Apr 26, 2023
26c0caa
clean up main.py
yiranwu0 Apr 26, 2023
2a1a47e
create pseudo_main.py
yiranwu0 Apr 26, 2023
edfc679
fix category loading bug
yiranwu0 Apr 26, 2023
6a15761
handle timeout
yiranwu0 Apr 26, 2023
ab64723
two new prompts
yiranwu0 Apr 26, 2023
f723a8f
add bash
yiranwu0 Apr 27, 2023
1a5c93c
more prompts
yiranwu0 Apr 27, 2023
955edca
change run sequence
yiranwu0 Apr 27, 2023
8519967
add more prompts
yiranwu0 Apr 28, 2023
912193e
catch wolfram error
yiranwu0 Apr 28, 2023
c8f90b4
more runs on v2.1 select, v1.2 select, add new v3select
yiranwu0 Apr 28, 2023
7a8c2ac
compress when all finished
yiranwu0 Apr 28, 2023
b9a7e04
py exec output fix
yiranwu0 Apr 28, 2023
65f1580
v3.1 select
yiranwu0 Apr 29, 2023
73088ce
new both prompt, v3.2select
yiranwu0 Apr 29, 2023
144c148
change execute to run
yiranwu0 Apr 29, 2023
812477a
refine query handling and v3.3select
yiranwu0 Apr 30, 2023
25e2708
catch wolfram errors
yiranwu0 Apr 30, 2023
1c00283
ablation on only using python and zeroshot baseline
yiranwu0 May 1, 2023
1330a00
change run sequence
yiranwu0 May 1, 2023
e61212f
new run
yiranwu0 May 1, 2023
2b5dd52
new run
yiranwu0 May 1, 2023
ac11d2a
consitent ouput folder in PoT
yiranwu0 May 1, 2023
9d291b9
1erun pot , refined prompt v1.3 v1.4 and v3.4
yiranwu0 May 2, 2023
ce7144a
resume 22 if not finished
yiranwu0 May 2, 2023
6fefde3
handle wolfram exception
yiranwu0 May 2, 2023
eaae6ce
one run for v1.5
yiranwu0 May 2, 2023
8fdf74f
one run for v1.5 corrections
yiranwu0 May 2, 2023
ca75c91
two more prompts v3.5select and v3.1python based on v3python
yiranwu0 May 3, 2023
47179ce
remove error string clipping
yiranwu0 May 3, 2023
a8c3758
handle UnicodeDecodeError
yiranwu0 May 3, 2023
132638a
handle UnicodeDecodeError
yiranwu0 May 3, 2023
280f9de
quick test on adding wolfram to v3.1python
yiranwu0 May 3, 2023
45a4abd
rerun v3.1 with refine, add v3.7select to further test wolfram
yiranwu0 May 4, 2023
b0efcbf
switch run seq v3.7select then v3.1python
yiranwu0 May 4, 2023
10c28ae
add v3.2python, slightly refine from v3.1. try out v3.3python
yiranwu0 May 4, 2023
bfe61aa
more args for PoT and refine load_leve5 func
yiranwu0 May 5, 2023
39ea367
trial 38-42: validate our methods on all level of problems, run large…
yiranwu0 May 5, 2023
0cebecb
update run.sh
yiranwu0 May 5, 2023
bddb610
move
sonichi May 6, 2023
326da82
add v4
yiranwu0 May 7, 2023
bd040b5
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 May 7, 2023
c8ba447
test with new system message
yiranwu0 May 7, 2023
62b5259
add baseline pnas, run v4.2 on level5 problems, test new sys message …
yiranwu0 May 8, 2023
ef509d4
fix trial 49
yiranwu0 May 8, 2023
e60850f
remove print
yiranwu0 May 8, 2023
5fe0b0b
run v3 with specified sentence removed, 4.2 with original sys message…
yiranwu0 May 9, 2023
d92b559
remove trial 52
yiranwu0 May 9, 2023
ede98a5
endpoint
sonichi May 9, 2023
7082355
Merge branch 'gpt_math_solver' of https://github.com/kevin666aa/FLAML…
sonichi May 9, 2023
7d34485
fix bug in queryhandler
yiranwu0 May 9, 2023
8e34218
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 May 9, 2023
c592837
fix queryhandler 2
yiranwu0 May 9, 2023
6345e0b
v3.3python
yiranwu0 May 10, 2023
40fd299
remove print
yiranwu0 May 10, 2023
dac1551
test final prompts
yiranwu0 May 11, 2023
fff4e4b
change run sequence
yiranwu0 May 11, 2023
da0f7d9
run exact v3.1 as before
yiranwu0 May 11, 2023
2775e08
keep runing v3.1python and add general_5
yiranwu0 May 11, 2023
ad10b71
add general_5
yiranwu0 May 11, 2023
7800a46
continue run 55 and 56
yiranwu0 May 12, 2023
4f78539
switch seq
yiranwu0 May 12, 2023
a76113f
trial 63 v3.5python, then run large-scale with v3.3python
yiranwu0 May 12, 2023
7d22c07
add v3.3, 3.7, 3.8
yiranwu0 May 13, 2023
908d283
revise 3.6-3.8
yiranwu0 May 13, 2023
079b4e2
v3.9
yiranwu0 May 13, 2023
f071214
test interalge and precal on v3.9
yiranwu0 May 13, 2023
6444f91
test v3.9 on 50 problems, then zero shot
yiranwu0 May 13, 2023
1c4a278
fix prompt
yiranwu0 May 13, 2023
c744613
endpoint
sonichi May 13, 2023
2ad469f
Merge branch 'gpt_math_solver' of https://github.com/kevin666aa/FLAML…
sonichi May 13, 2023
b806dfb
run all problems on v3.9, and pnas
yiranwu0 May 13, 2023
3733028
endpoint
sonichi May 13, 2023
cbd0be0
Merge remote-tracking branch 'upstream/main' into gpt_math_solver
May 15, 2023
89d7512
run v1python
yiranwu0 May 15, 2023
3791326
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 May 15, 2023
2a6ffa1
run v1python+wolfram
yiranwu0 May 16, 2023
d833938
run pot with sys message
yiranwu0 May 19, 2023
bf73756
endpoint
sonichi May 19, 2023
f1b3873
Merge branch 'gpt_math_solver' of https://github.com/kevin666aa/FLAML…
sonichi May 19, 2023
e3d8de1
run pot with system message
yiranwu0 May 19, 2023
d84213d
Merge branch 'gpt_math_solver' of https://github.com/kevin666aa/FLAML…
sonichi May 19, 2023
f8c68ff
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
May 19, 2023
9bc17db
fewshot+zeroshot prompt
May 19, 2023
769803e
add assert
May 19, 2023
85d9b59
refine fewshot
yiranwu0 May 20, 2023
bce7f4f
run pre-commit
yiranwu0 May 20, 2023
59bc9f9
rerun v3.9 with cache and get token info
yiranwu0 May 21, 2023
32de58f
run PoT on all problems
yiranwu0 May 22, 2023
9dabf61
Merge remote-tracking branch 'upstream/main' into gpt_math_solver
yiranwu0 May 22, 2023
9c3efd4
merge new changes and update pot
yiranwu0 May 22, 2023
fc8bcdc
endpoint
sonichi May 22, 2023
c711143
Merge branch 'gpt_math_solver' of https://github.com/kevin666aa/FLAML…
sonichi May 22, 2023
f535e50
fix decode in PoT
yiranwu0 May 22, 2023
841ff2a
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 May 22, 2023
43d8277
clean up and rename
yiranwu0 May 27, 2023
01f7712
resolve conflict in setup
yiranwu0 May 27, 2023
d4d8242
Merge branch 'microsoft:main' into gpt_math_solver
yiranwu0 Jun 7, 2023
1cfce5f
clean up
yiranwu0 Jun 7, 2023
be7bb3d
update readme
yiranwu0 Jun 7, 2023
d3e8719
add mathchat flow hart
yiranwu0 Jun 7, 2023
2c8823f
Update README.md
yiranwu0 Jun 7, 2023
7808b4f
Merge branch 'microsoft:main' into gpt_math_solver
yiranwu0 Jun 10, 2023
348446b
add missing files
yiranwu0 Jul 10, 2023
c49ab9c
Merge branch 'gpt_math_solver' of github.com:kevin666aa/FLAML into gp…
yiranwu0 Jul 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions flaml/autogen/math/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Math Solver

## Run

1. Set up env

```
pip install -e .[math]
```

2. In `main.py`: set openai_key and wolfram id

```
openai.key = "Your key here"
os.environ["WOLFRAM_ALPHA_APPID"] = "Your id here"
```

3. Test out `main.py` with `--test_run`

```
cd flaml/autogen/math_solver
python main.py --prompt_type select --test_run --categories all
```

Arguments:

```
python main.py \
--prompt_type ['select', 'python', 'wolfram'] \
--max_round [default=15] \
--folder [default='./autotools'] \
--cache_folder [default='./cache'] \
--samples_per_category [default=20] \
--temperature [default=1, range[0,2]] \
--prompt_location [default='user', choose from ['user', 'system']]
--categories [default=[0,1], list of category ids below or 'all' (meaning all 7 categories)]
[--test_run] # test run
```



0 Algebra
1 Counting & Probability
2 Geometry
3 Intermediate Algebra
4 Number Theory
5 Prealgebra
6 Precalculus

5. Check results from path `saving_folder` (default is './autotools).

### Baselines

1. Program of Thoughts (PoT)

```
cd flaml/autogen/math_solver
python baselines/PoT.py
```

Arguments:

```
python baselines/PoT.py \
--folder [default='./PoT'] \
--cache_folder [default='./cache/PoT'] \
--samples_per_category [default=20] \
[--dry_run] # output prompt with one problem from each category and do not query openai
```

## Implementation

- `QueryHandler.py`:

- Function `handle_query`:
1. Parse all queries given an input string.
2. Iterate over queries and call python or wolfram
3. Return all results and a boolean indicating whether all the queries are executed without error.
- `MathSolver.py`: Main solver using tools.

- Setting:

- `use_cache=True`
- `oai.ChatCompletion.request_timeout = 60*10`
- `max_round=15`: max round of messages allowed
- `len(query_response) < 2000`: The response should have less than 2000 chars (around 600-1000 tokens). To prevent excessive decimal numbers from python code.
- `max_invalid_q_per_step=3`: For one step, if we keep getting invalide results for 3 times, we ask the LLM to solve the query itself.
- Function `make_conversation`: get response from openai, extract query from response and get results

- Answer is valid if '\boxed{}' is detected.
- Answer is invalid (return empty string) if
- exceed max_round (of conversations)
- exceed max token (8192 for GPT-4)
- char count of query reply > 2000
- Function `solve_one_category`: Solve problems from one category.

- Assumption 1: when called with a problem set, all problems are of the same type
- Assumption 2: if resume from a previous run, the sequence of problems from one category are the same as the previous run. This should be fine as long as the same shuffling seed is used.

## Prompts

Three prompts available: ['select', 'python', 'wolfram'].
'select' allows the model to choose from two tools, 'python' and 'wolfram' corresponding to one tool only.

Please see `math_solver.py` for the prompts.
File renamed without changes.
160 changes: 160 additions & 0 deletions flaml/autogen/math/baselines/PoT.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# adapted from https://github.com/wenhuchen/Program-of-Thoughts/blob/main/run_gsm8k_zs.py
import openai
from time import sleep
from tool import synthesize_program
from collections import Counter
from datetime import datetime
from tqdm import tqdm
import os
import json
import argparse
from flaml import oai
import datasets

# Caution: distinguish between the two types imports
from flaml.autogen.math_utils import eval_math_responses, get_answer
from flaml.autogen.math.utils import (
load_level5_math_each_category,
math_type_mapping,
write_json,
remove_asy_sections,
mylogger,
)
from flaml.autogen.code_utils import execute_code


parser = argparse.ArgumentParser()
# parser.add_argument("--key", default='OPENAI_KEY', type=str)
parser.add_argument("--dry_run", default=False, action="store_true")
parser.add_argument("--folder", "-f", dest="folder", help="saving folder", default="./PoT", type=str)
parser.add_argument("--cache_folder", "-c", dest="cache_folder", default=".cache/PoT", help="cache folder")
parser.add_argument("--samples_per_category", "-s", help="samples per category", default=20, type=int)
args = parser.parse_args()

# key = os.getenv(args.key)
# print(key)


def PoT_solve(model, problem, max_tokens=None):
commented_problem = problem["problem"].replace("\n", "\n# ") # in case the problem is multiline
commented_problem = remove_asy_sections(commented_problem)
full_prompt = f"""
import math
import numpy as np
import sympy as sp # added

# Question: {commented_problem}
# Answer this question by implementing a solver() function.
def solver():
# Let's write a Python program step by step, and then return the answer
# Firstly, we need define the following variable:
"""
with open(os.path.join(args.folder, "prompt.txt"), "w") as f:
f.write(full_prompt)
if args.dry_run:
print(full_prompt)
print("=======================")
return

config = {
"model": model,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": full_prompt},
],
"n": 1,
}
if max_tokens is not None:
config["max_tokens"] = max_tokens

raw_responses = oai.ChatCompletion.create(None, **config, use_cache=True)
responses = oai.ChatCompletion.extract_text(raw_responses)

# TODO: adapt for voting
program = synthesize_program(responses[0], full_prompt)
return_code, ans = execute_code(program, timeout=5, use_docker=False)
ans = ans.decode("ascii").strip() if type(ans) is bytes else ans
ans = "Error" if return_code != 0 or ans is None else ans
response_with_ans = "\\boxed{" + str(ans) + "}"

prompt_price = (
oai.ChatCompletion.price1K[model][0]
if type(oai.ChatCompletion.price1K[model]) == tuple
else oai.ChatCompletion.price1K[model]
)
return {
"cost": oai.ChatCompletion.cost(model, raw_responses),
"prompt_cost": prompt_price * raw_responses["usage"]["prompt_tokens"] / 1000,
"response_with_ans": response_with_ans,
"program": program,
}


if __name__ == "__main__":
oai.ChatCompletion.request_timeout = 60 * 10 # 10 minutes
oai.ChatCompletion.set_cache(seed=41, cache_path=args.cache_folder)

os.makedirs(args.folder, exist_ok=True)
logger = mylogger(os.path.join(args.folder, "log.txt"))

engine = "gpt-4"
aggre_correct = 0
problem_sets = load_level5_math_each_category(samples_per_category=args.samples_per_category)
logger.log("problem id: is_correct $ ans $ correct_ans $ accum_acc", verbose=True)

for problem_set in problem_sets: # one problem_set is one category
for i in range(len(problem_set)):
problem_set[i]["problem_id"] = str(i) # assign problem id

logger.log("Solving " + problem_set[0]["type"], verbose=True)
saving_folder = os.path.join(args.folder, math_type_mapping[problem_set[0]["type"]])
os.makedirs(saving_folder, exist_ok=True)
done_problems = set([int(f.split(".")[0]) for f in os.listdir(saving_folder) if "json" in f])

correct_counts = 0
for count, problem in enumerate(problem_set):
problem_path = os.path.join(saving_folder, problem["problem_id"] + ".json")

# 1. if problem already solved, continue
if int(problem["problem_id"]) in done_problems:
problem = json.load(open(problem_path, "r"))
aggre_correct += problem["is_correct"]
correct_counts += problem["is_correct"]
logger.log(
f"{count}: {problem['is_correct']} $ {problem['voted_answer']} $ {problem['correct_ans']} $ {round(correct_counts / (count + 1), 4)} (loaded from previous run)",
verbose=True,
)
continue

results = PoT_solve(engine, problem)
metrics = eval_math_responses([results["response_with_ans"]], problem["solution"])
aggre_correct += metrics["success_vote"]
correct_counts += metrics["success_vote"]

problem.update(
{
"cost": results["cost"],
"is_correct": bool(metrics["success_vote"]),
"correct_ans": get_answer(problem["solution"]),
"voted_answer": get_answer(metrics["voted_answer"]),
"program": results["program"],
}
)
write_json(problem, problem_path)
logger.log(
f"{count}: {problem['is_correct']} $ {problem['voted_answer']} $ {problem['correct_ans']}",
verbose=True,
)
if args.dry_run:
break
logger.log(
f"{problem_set[0]['type']} acc: {correct_counts}/{len(problem_set)}= {round(correct_counts / len(problem_set), 4)}",
)
logger.log("-----------------------------------")
os.system("tar -czf " + args.folder + ".tar.gz " + args.folder)

logger.log(
f"Total accuracy: {aggre_correct}/{(len(problem_sets) * len(problem_sets[0]))}={round(aggre_correct / (len(problem_sets) * len(problem_sets[0])), 4)}",
)
logger.log("****************************\n\n\n\n")
os.system("tar -czf " + args.folder + ".tar.gz " + args.folder)
Loading