-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: prompt engineering to enhance output stability #119
chore: prompt engineering to enhance output stability #119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pushing this 😄 left some thoughts
{{ | ||
"type": "unsupported", | ||
"message": "Reason why the prompt is unsupported here" | ||
}} | ||
|
||
Please note: You can only output content in json format, and do not output any other content! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is repeated in line 121 - is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have found that for small parameter models, emphasizing the most important content before and after a long prompt seems to be helpful for the stability of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.
i think it makes sense to have one prompt for gpt-4 and another for small parameter models
based on the benchmark results from #119 (review) i would recommend updating the examples with the changes i suggested, see if they improve the planning (which they should, since the examples rn are incorrect) and if it does not improve, then use different prompts for the different models 😄 - otherwise, i don't think it is necessary to have different prompts for the models in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for right now I think we should have a single prompt. Adding more will just add another dimension of complexity which will make it harder to get consistent results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with you. In the case where gpt-4 can already handle the requirements of concise prompts, more examples may only consume tokens. I think two different sets of prompts can be submitted to adapt to gpt. -4 and our local llms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After benchmarking this locally via python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10
I found that it performed worse than main
does currently (NOTE: i pushed a fix for this test to main, need to merge into this branch).
While main passed 8/10 times mainly relating to this bug #112, this branch passes 6/10 times with the primary problem being incorrect planning. To illustrate this I'll paste a good plan and one of the bad plans it emits.
Good:
Batched transactions:
1. Swap 0.31397774711956405 WETH for 1000.0 USDC (nonce: 0)
2. Approve 500.0 USDC to Uniswap (nonce: 1)
3. Swap 500.0 USDC for 0.00716602 WBTC (nonce: 2)
Bad:
Batched transactions:
1. Swap 0.3144388268153504 WETH for 1000.0 USDC (nonce: 0)
2. Approve 34624816.583316 USDC to Uniswap (nonce: 1)
3. Swap 34624816.583316 USDC for 500.0 WBTC (nonce: 2)
4. Approve 500.0 USDC to Uniswap (nonce: 3)
5. Swap 500.0 USDC for 0.00720219 WBTC (nonce: 4)
As you can see it's inserting a strange swap in step 2 & 3, not sure why. I'd like to see this fixed before merging this PR.
It seems that I can't run your unit test here. When I use |
@FromCSUZhou hey ser have you tried to run |
yes, I have tried this, please refer the fully fail log: autotx/tests/token/test_swap.py 85140439500af06769197c19f70dcb3e8a49205e35bf3f3986933a1073bacef9 ==================================== ERRORS ====================================
autotx/tests/conftest.py:33:
E SystemExit: Can not connect with local node. Did you run autotx/utils/configuration.py:21: SystemExit ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832 ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832 ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2317 ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/enum.py:289 ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979 ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:628: 19 warnings ../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_config.py:272 -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html |
okay yah this is a known issue @FromCSUZhou which is being tracked here #105 in the meantime, would it be possible that you run things manually? doing |
Also @FromCSUZhou please note that, when running tests, you do not need to start the fork yourself. The test runner will start a fresh chain fork before running each test. Be sure to have Docker running as well :) |
follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>
Yes, everything is working fine except the unit tests |
/workflows/benchmarks agents/token |
Test Run Summary
Detailed Results
Total run time: 64.46 minutes |
I think this PR is no longer needed as we've incorporated multi-shot prompting in another PR |
Some prompt engineering to improve the output quality and stability, especially for small models that are not as powerful as GPT-4