Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: prompt engineering to enhance output stability #119

Conversation

nickcom007
Copy link

Some prompt engineering to improve the output quality and stability, especially for small models that are not as powerful as GPT-4

Copy link
Contributor

@cbrzn cbrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pushing this 😄 left some thoughts

{{
"type": "unsupported",
"message": "Reason why the prompt is unsupported here"
}}

Please note: You can only output content in json format, and do not output any other content!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is repeated in line 121 - is that intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have found that for small parameter models, emphasizing the most important content before and after a long prompt seems to be helpful for the stability of the output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.

Copy link
Contributor

@cbrzn cbrzn Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we use different prompts for strong inference models such as GPT-4 or small parameter models that need to constantly emphasize requirements? It is true that long prompts may not be necessary for gpt-4, which can already handle it well.

i think it makes sense to have one prompt for gpt-4 and another for small parameter models

based on the benchmark results from #119 (review) i would recommend updating the examples with the changes i suggested, see if they improve the planning (which they should, since the examples rn are incorrect) and if it does not improve, then use different prompts for the different models 😄 - otherwise, i don't think it is necessary to have different prompts for the models in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for right now I think we should have a single prompt. Adding more will just add another dimension of complexity which will make it harder to get consistent results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with you. In the case where gpt-4 can already handle the requirements of concise prompts, more examples may only consume tokens. I think two different sets of prompts can be submitted to adapt to gpt. -4 and our local llms

autotx/utils/agent/build_goal.py Outdated Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved
@cbrzn cbrzn requested a review from nerfZael April 2, 2024 05:59
Copy link
Contributor

@dOrgJelli dOrgJelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After benchmarking this locally via python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10 I found that it performed worse than main does currently (NOTE: i pushed a fix for this test to main, need to merge into this branch).

While main passed 8/10 times mainly relating to this bug #112, this branch passes 6/10 times with the primary problem being incorrect planning. To illustrate this I'll paste a good plan and one of the bad plans it emits.

Good:

Batched transactions:
1. Swap 0.31397774711956405 WETH for 1000.0 USDC (nonce: 0)
2. Approve 500.0 USDC to Uniswap (nonce: 1)
3. Swap 500.0 USDC for 0.00716602 WBTC (nonce: 2)

Bad:

Batched transactions:
1. Swap 0.3144388268153504 WETH for 1000.0 USDC (nonce: 0)
2. Approve 34624816.583316 USDC to Uniswap (nonce: 1)
3. Swap 34624816.583316 USDC for 500.0 WBTC (nonce: 2)
4. Approve 500.0 USDC to Uniswap (nonce: 3)
5. Swap 500.0 USDC for 0.00720219 WBTC (nonce: 4)

As you can see it's inserting a strange swap in step 2 & 3, not sure why. I'd like to see this fixed before merging this PR.

autotx/utils/agent/build_goal.py Show resolved Hide resolved
@FromCSUZhou
Copy link
Contributor

After benchmarking this locally via python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10 I found that it performed worse than main does currently (NOTE: i pushed a fix for this test to main, need to merge into this branch).

While main passed 8/10 times mainly relating to this bug #112, this branch passes 6/10 times with the primary problem being incorrect planning. To illustrate this I'll paste a good plan and one of the bad plans it emits.

Good:

Batched transactions:
1. Swap 0.31397774711956405 WETH for 1000.0 USDC (nonce: 0)
2. Approve 500.0 USDC to Uniswap (nonce: 1)
3. Swap 500.0 USDC for 0.00716602 WBTC (nonce: 2)

Bad:

Batched transactions:
1. Swap 0.3144388268153504 WETH for 1000.0 USDC (nonce: 0)
2. Approve 34624816.583316 USDC to Uniswap (nonce: 1)
3. Swap 34624816.583316 USDC for 500.0 WBTC (nonce: 2)
4. Approve 500.0 USDC to Uniswap (nonce: 3)
5. Swap 500.0 USDC for 0.00720219 WBTC (nonce: 4)

As you can see it's inserting a strange swap in step 2 & 3, not sure why. I'd like to see this fixed before merging this PR.

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

@cbrzn
Copy link
Contributor

cbrzn commented Apr 2, 2024

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

@FromCSUZhou hey ser have you tried to run poetry run stop-fork and then retry to run the unit test?

@FromCSUZhou
Copy link
Contributor

It seems that I can't run your unit test here. When I use poetry run start-fork alone, the unit test will report the error ocker: Error response from daemon: Conflict. The container name "/autotx_chain_fork" is already in use; When I remove docker container and directly use python benchmarks.py ./autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple 10, it will report the error Can not connect with local node. Did you run poetry run start-fork? Is it my setting problem?

@FromCSUZhou hey ser have you tried to run poetry run stop-fork and then retry to run the unit test?

yes, I have tried this, please refer the fully fail log:
============================= test session starts ==============================
platform darwin -- Python 3.10.14, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/tuozhou/Desktop/Work/Flock/AutoTx
configfile: pyproject.toml
plugins: anyio-4.3.0, vcr-1.0.2, web3-6.15.1
collected 1 item

autotx/tests/token/test_swap.py 85140439500af06769197c19f70dcb3e8a49205e35bf3f3986933a1073bacef9
Eautotx_chain_fork

==================================== ERRORS ====================================
_________________ ERROR at setup of test_auto_tx_swap_multiple _________________

@pytest.fixture()
def configuration():
  (_, agent, client) = get_configuration()

autotx/tests/conftest.py:33:


def get_configuration():
    w3 = Web3(HTTPProvider(FORK_RPC_URL))
    for i in range(10):
        if w3.is_connected():
            break
        if i == 9:
          sys.exit("Can not connect with local node. Did you run `poetry run start-fork`?")

E SystemExit: Can not connect with local node. Did you run poetry run start-fork?

autotx/utils/configuration.py:21: SystemExit
=============================== warnings summary ===============================
../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/crewai/telemetry/telemetry.py:6
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/crewai/telemetry/telemetry.py:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2832: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google.cloud').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2317
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pkg_resources/init.py:2317: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(parent)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/enum.py:289
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/enum.py:289: DeprecationWarning: FUNCTIONS is deprecated and will be removed in future versions
enum_member = new(enum_class, *args)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979
../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic_core/core_schema.py:3979: DeprecationWarning: FieldValidationInfo is deprecated, use ValidationInfo instead.
warnings.warn(msg, DeprecationWarning, stacklevel=1)

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:628: 19 warnings
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:628: PydanticDeprecatedSince20: __get_validators__ is deprecated and will be removed, use __get_pydantic_core_schema__ instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warn(

../../../../opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_config.py:272
/Users/tuozhou/opt/anaconda3/envs/AutoTx/lib/python3.10/site-packages/pydantic/_internal/_config.py:272: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn(DEPRECATION_MESSAGE, DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
ERROR autotx/tests/token/test_swap.py::test_auto_tx_swap_multiple - SystemExi...
======================== 27 warnings, 1 error in 9.44s =========================

@cbrzn
Copy link
Contributor

cbrzn commented Apr 2, 2024

okay yah this is a known issue @FromCSUZhou which is being tracked here #105

in the meantime, would it be possible that you run things manually? doing poetry run start-fork and then poetry run ask --prompt="Buy 1000 USDC with ETH and then buy WBTC with 500 USDC"

@dOrgJelli
Copy link
Contributor

dOrgJelli commented Apr 2, 2024

Also @FromCSUZhou please note that, when running tests, you do not need to start the fork yourself. The test runner will start a fresh chain fork before running each test. Be sure to have Docker running as well :)

autotx/utils/agent/build_goal.py Outdated Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved
autotx/utils/agent/define_tasks.py Outdated Show resolved Hide resolved
FromCSUZhou and others added 4 commits April 4, 2024 20:53
follow cbrzn suggestion

Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion

Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion

Co-authored-by: Cesar Brazon <[email protected]>
follow cbrzn suggestion

Co-authored-by: Cesar Brazon <[email protected]>
@FromCSUZhou
Copy link
Contributor

okay yah this is a known issue @FromCSUZhou which is being tracked here #105

in the meantime, would it be possible that you run things manually? doing poetry run start-fork and then poetry run ask --prompt="Buy 1000 USDC with ETH and then buy WBTC with 500 USDC"

Yes, everything is working fine except the unit tests

@cbrzn
Copy link
Contributor

cbrzn commented Apr 4, 2024

/workflows/benchmarks agents/token

Copy link

github-actions bot commented Apr 4, 2024

Finished benchmarks

Test Run Summary

  • Run from: ./autotx/tests/agents/token
  • Iterations: 5
  • Total Success Rate: 88.00%

Detailed Results

Test Name Success Rate Passes Fails Avg Time
autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_eth 80% 4 1 41s
autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_erc20 80% 4 1 42s
autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_eth_sequential 100% 5 0 1.36m
autotx/tests/agents/token/send/test_send.py::test_auto_tx_send_erc20_parallel 100% 5 0 1.05m
autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent 100% 5 0 25s
autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent_with_check_eth 80% 4 1 29s
autotx/tests/agents/token/send/test_send_with_tasks.py::test_send_tokens_agent_with_check_erc20 100% 5 0 29s
autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_with_non_default_token 80% 4 1 54s
autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_eth 80% 4 1 51s
autotx/tests/agents/token/test_swap.py::test_auto_tx_swap_multiple 80% 4 1 58s
autotx/tests/agents/token/test_swap_and_send.py::test_auto_tx_swap_and_send_simple 40% 2 3 57s
autotx/tests/agents/token/test_swap_and_send.py::test_auto_tx_swap_and_send_complex 80% 4 1 1.05m
autotx/tests/agents/token/test_token_research.py::test_price_change_information 100% 5 0 15s
autotx/tests/agents/token/test_token_research.py::test_token_general_information 100% 5 0 30s
autotx/tests/agents/token/test_token_research.py::test_get_token_exchanges 100% 5 0 15s
autotx/tests/agents/token/test_token_research.py::test_check_liquidity 100% 5 0 14s
autotx/tests/agents/token/test_token_research.py::test_get_top_5_tokens_from_base 100% 5 0 29s
autotx/tests/agents/token/test_token_research.py::test_get_top_5_most_traded_tokens_from_l1 100% 5 0 24s
autotx/tests/agents/token/test_token_research.py::test_get_top_5_memecoins 100% 5 0 28s
autotx/tests/agents/token/test_token_research.py::test_get_top_5_memecoins_in_optimism 60% 3 2 26s

Total run time: 64.46 minutes

@nerfZael nerfZael self-requested a review April 4, 2024 18:47
@dOrgJelli
Copy link
Contributor

I think this PR is no longer needed as we've incorporated multi-shot prompting in another PR

@dOrgJelli dOrgJelli closed this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants