chore: prompt engineering to enhance output stability #119

nickcom007 · 2024-04-02T01:35:13Z

Some prompt engineering to improve the output quality and stability, especially for small models that are not as powerful as GPT-4

cbrzn

thanks for pushing this 😄 left some thoughts

cbrzn · 2024-04-02T05:49:42Z

autotx/utils/agent/build_goal.py

        {{
            "type": "unsupported",
            "message": "Reason why the prompt is unsupported here"
        }}
+
+        Please note: You can only output content in json format, and do not output any other content!


this line is repeated in line 121 - is that intended?

Yes, we have found that for small parameter models, emphasizing the most important content before and after a long prompt seems to be helpful for the stability of the output.

Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.

Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.

i think it makes sense to have one prompt for gpt-4 and another for small parameter models

based on the benchmark results from #119 (review) i would recommend updating the examples with the changes i suggested, see if they improve the planning (which they should, since the examples rn are incorrect) and if it does not improve, then use different prompts for the different models 😄 - otherwise, i don't think it is necessary to have different prompts for the models in this PR

Yes for right now I think we should have a single prompt. Adding more will just add another dimension of complexity which will make it harder to get consistent results.

Yes, I agree with you. In the case where gpt-4 can already handle the requirements of concise prompts, more examples may only consume tokens. I think two different sets of prompts can be submitted to adapt to gpt. -4 and our local llms

autotx/utils/agent/build_goal.py

autotx/utils/agent/define_tasks.py

autotx/utils/agent/build_goal.py

dOrgJelli

After benchmarking this locally via python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10 I found that it performed worse than main does currently (NOTE: i pushed a fix for this test to main, need to merge into this branch).

While main passed 8/10 times mainly relating to this bug #112, this branch passes 6/10 times with the primary problem being incorrect planning. To illustrate this I'll paste a good plan and one of the bad plans it emits.

Good:

Batched transactions:
1. Swap 0.31397774711956405 WETH for 1000.0 USDC (nonce: 0)
2. Approve 500.0 USDC to Uniswap (nonce: 1)
3. Swap 500.0 USDC for 0.00716602 WBTC (nonce: 2)

Bad:

Batched transactions:
1. Swap 0.3144388268153504 WETH for 1000.0 USDC (nonce: 0)
2. Approve 34624816.583316 USDC to Uniswap (nonce: 1)
3. Swap 34624816.583316 USDC for 500.0 WBTC (nonce: 2)
4. Approve 500.0 USDC to Uniswap (nonce: 3)
5. Swap 500.0 USDC for 0.00720219 WBTC (nonce: 4)

As you can see it's inserting a strange swap in step 2 & 3, not sure why. I'd like to see this fixed before merging this PR.

autotx/utils/agent/build_goal.py

FromCSUZhou · 2024-04-02T15:07:10Z

After benchmarking this locally via python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10 I found that it performed worse than main does currently (NOTE: i pushed a fix for this test to main, need to merge into this branch).

While main passed 8/10 times mainly relating to this bug #112, this branch passes 6/10 times with the primary problem being incorrect planning. To illustrate this I'll paste a good plan and one of the bad plans it emits.

Good:
Batched transactions:
1. Swap 0.31397774711956405 WETH for 1000.0 USDC (nonce: 0)
2. Approve 500.0 USDC to Uniswap (nonce: 1)
3. Swap 500.0 USDC for 0.00716602 WBTC (nonce: 2)
Bad:
Batched transactions:
1. Swap 0.3144388268153504 WETH for 1000.0 USDC (nonce: 0)
2. Approve 34624816.583316 USDC to Uniswap (nonce: 1)
3. Swap 34624816.583316 USDC for 500.0 WBTC (nonce: 2)
4. Approve 500.0 USDC to Uniswap (nonce: 3)
5. Swap 500.0 USDC for 0.00720219 WBTC (nonce: 4)
As you can see it's inserting a strange swap in step 2 & 3, not sure why. I'd like to see this fixed before merging this PR.

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

cbrzn · 2024-04-02T15:22:27Z

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

@FromCSUZhou hey ser have you tried to run poetry run stop-fork and then retry to run the unit test?

FromCSUZhou · 2024-04-02T15:39:48Z

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

@FromCSUZhou hey ser have you tried to run poetry run stop-fork and then retry to run the unit test?

yes, I have tried this, please refer the fully fail log:
============================= test session starts ==============================
platform darwin -- Python 3.10.14, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/tuozhou/Desktop/Work/Flock/AutoTx
configfile: pyproject.toml
plugins: anyio-4.3.0, vcr-1.0.2, web3-6.15.1
collected 1 item

autotx/tests/token/test_swap.py 85140439500af06769197c19f70dcb3e8a49205e35bf3f3986933a1073bacef9
Eautotx_chain_fork

==================================== ERRORS ====================================
_________________ ERROR at setup of test_auto_tx_swap_multiple _________________

@pytest.fixture()
def configuration():

  (_, agent, client) = get_configuration()

autotx/tests/conftest.py:33:

def get_configuration():
    w3 = Web3(HTTPProvider(FORK_RPC_URL))
    for i in range(10):
        if w3.is_connected():
            break
        if i == 9:

          sys.exit("Can not connect with local node. Did you run `poetry run start-fork`?")

E SystemExit: Can not connect with local node. Did you run poetry run start-fork?

autotx/utils/configuration.py:21: SystemExit
=============================== warnings summary ===============================
../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/crewai/telemetry/telemetry.py:6
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/crewai/telemetry/telemetry.py:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google.cloud').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2317
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2317: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(parent)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/enum.py:289
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/enum.py:289: DeprecationWarning: FUNCTIONS is deprecated and will be removed in future versions
enum_member = new(enum_class, *args)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979
../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979: DeprecationWarning: FieldValidationInfo is deprecated, use ValidationInfo instead.
warnings.warn(msg, DeprecationWarning, stacklevel=1)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:628: 19 warnings
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:628: PydanticDeprecatedSince20: __get_validators__ is deprecated and will be removed, use __get_pydantic_core_schema__ instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warn(

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_config.py:272
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_config.py:272: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn(DEPRECATION_MESSAGE, DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
ERROR autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple - SystemExi...
======================== 27 warnings, 1 error in 9.44s =========================

cbrzn · 2024-04-02T16:45:53Z

okay yah this is a known issue @FromCSUZhou which is being tracked here #105

in the meantime, would it be possible that you run things manually? doing poetry run start-fork and then poetry run ask --prompt="Buy 1000 USDC with ETH and then buy WBTC with 500 USDC"

dOrgJelli · 2024-04-02T22:21:49Z

Also @FromCSUZhou please note that, when running tests, you do not need to start the fork yourself. The test runner will start a fresh chain fork before running each test. Be sure to have Docker running as well :)

autotx/utils/agent/build_goal.py

autotx/utils/agent/define_tasks.py

follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>

FromCSUZhou · 2024-04-04T12:55:35Z

okay yah this is a known issue @FromCSUZhou which is being tracked here #105

in the meantime, would it be possible that you run things manually? doing poetry run start-fork and then poetry run ask --prompt="Buy 1000 USDC with ETH and then buy WBTC with 500 USDC"

Yes, everything is working fine except the unit tests

cbrzn · 2024-04-04T16:41:41Z

/workflows/benchmarks agents/token

github-actions · 2024-04-04T16:41:51Z

Finished benchmarks

Test Run Summary

Run from: ./autotx/tests/agents/token
Iterations: 5
Total Success Rate: 88.00%

Detailed Results

Test Name	Success Rate	Passes	Fails	Avg Time
`autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_eth`	80%	4	1	41s
`autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_erc20`	80%	4	1	42s
`autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_eth_sequential`	100%	5	0	1.36m
`autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_erc20_parallel`	100%	5	0	1.05m
`autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent`	100%	5	0	25s
`autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent_with_check_eth`	80%	4	1	29s
`autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent_with_check_erc20`	100%	5	0	29s
`autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_with_non_default_token`	80%	4	1	54s
`autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_eth`	80%	4	1	51s
`autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_multiple`	80%	4	1	58s
`autotx/tests/agents/token/test_swap_and_send.py::test_auto_tx_swap_and_send_simple`	40%	2	3	57s
`autotx/tests/agents/token/test_swap_and_send.py::test_auto_tx_swap_and_send_complex`	80%	4	1	1.05m
`autotx/tests/agents/token/test_token_research.py::test_price_change_information`	100%	5	0	15s
`autotx/tests/agents/token/test_token_research.py::test_token_general_information`	100%	5	0	30s
`autotx/tests/agents/token/test_token_research.py::test_get_token_exchanges`	100%	5	0	15s
`autotx/tests/agents/token/test_token_research.py::test_check_liquidity`	100%	5	0	14s
`autotx/tests/agents/token/test_token_research.py::test_get_top_5_tokens_from_base`	100%	5	0	29s
`autotx/tests/agents/token/test_token_research.py::test_get_top_5_most_traded_tokens_from_l1`	100%	5	0	24s
`autotx/tests/agents/token/test_token_research.py::test_get_top_5_memecoins`	100%	5	0	28s
`autotx/tests/agents/token/test_token_research.py::test_get_top_5_memecoins_in_optimism`	60%	3	2	26s

Total run time: 64.46 minutes

dOrgJelli · 2024-04-18T12:06:43Z

I think this PR is no longer needed as we've incorporated multi-shot prompting in another PR

chore: prompt engineering to enhance output stability

e209264

cbrzn reviewed Apr 2, 2024

View reviewed changes

cbrzn requested a review from nerfZael April 2, 2024 05:59

nerfZael reviewed Apr 2, 2024

View reviewed changes

autotx/utils/agent/build_goal.py Show resolved Hide resolved

autotx/utils/agent/build_goal.py Show resolved Hide resolved

dOrgJelli requested changes Apr 2, 2024

View reviewed changes

autotx/utils/agent/build_goal.py Show resolved Hide resolved

chore: prompt engineering to enhance output stability

a31b782

cbrzn reviewed Apr 3, 2024

View reviewed changes

autotx/utils/agent/build_goal.py Outdated Show resolved Hide resolved

autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved

autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved

autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved

FromCSUZhou and others added 4 commits April 4, 2024 20:53

Update autotx/utils/agent/build_goal.py

1a5549e

follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>

Update autotx/utils/agent/define_tasks.py

5e132af

follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>

Update autotx/utils/agent/define_tasks.py

ff60f91

follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>

Update autotx/utils/agent/define_tasks.py

0fa56fe

follow cbrzn suggestion Co-authored-by: Cesar Brazon <[email protected]>

Merge branch 'polywrap:main' into prompt-engineering-on-task-and-goal

9259e0e

nerfZael self-requested a review April 4, 2024 18:47

nerfZael approved these changes Apr 4, 2024

View reviewed changes

dOrgJelli closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: prompt engineering to enhance output stability #119

chore: prompt engineering to enhance output stability #119

nickcom007 commented Apr 2, 2024

cbrzn left a comment

cbrzn Apr 2, 2024

FromCSUZhou Apr 2, 2024

FromCSUZhou Apr 2, 2024

cbrzn Apr 3, 2024 •

edited

Loading

dOrgJelli Apr 3, 2024

FromCSUZhou Apr 4, 2024

dOrgJelli left a comment •

edited

Loading

FromCSUZhou commented Apr 2, 2024

cbrzn commented Apr 2, 2024

FromCSUZhou commented Apr 2, 2024

cbrzn commented Apr 2, 2024

dOrgJelli commented Apr 2, 2024 •

edited

Loading

FromCSUZhou commented Apr 4, 2024

cbrzn commented Apr 4, 2024

github-actions bot commented Apr 4, 2024 •

edited

Loading

dOrgJelli commented Apr 18, 2024

chore: prompt engineering to enhance output stability #119

chore: prompt engineering to enhance output stability #119

Conversation

nickcom007 commented Apr 2, 2024

cbrzn left a comment

Choose a reason for hiding this comment

cbrzn Apr 2, 2024

Choose a reason for hiding this comment

FromCSUZhou Apr 2, 2024

Choose a reason for hiding this comment

FromCSUZhou Apr 2, 2024

Choose a reason for hiding this comment

cbrzn Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

dOrgJelli Apr 3, 2024

Choose a reason for hiding this comment

FromCSUZhou Apr 4, 2024

Choose a reason for hiding this comment

dOrgJelli left a comment • edited Loading

Choose a reason for hiding this comment

FromCSUZhou commented Apr 2, 2024

cbrzn commented Apr 2, 2024

FromCSUZhou commented Apr 2, 2024

cbrzn commented Apr 2, 2024

dOrgJelli commented Apr 2, 2024 • edited Loading

FromCSUZhou commented Apr 4, 2024

cbrzn commented Apr 4, 2024

github-actions bot commented Apr 4, 2024 • edited Loading

Test Run Summary

Detailed Results

dOrgJelli commented Apr 18, 2024

cbrzn Apr 3, 2024 •

edited

Loading

dOrgJelli left a comment •

edited

Loading

dOrgJelli commented Apr 2, 2024 •

edited

Loading

github-actions bot commented Apr 4, 2024 •

edited

Loading