Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Natural Language Configuration Modifications: E2E Tests. #889

Open
Tracked by #886
0x4007 opened this issue Nov 16, 2023 · 4 comments
Open
Tracked by #886

Natural Language Configuration Modifications: E2E Tests. #889

0x4007 opened this issue Nov 16, 2023 · 4 comments

Comments

@0x4007
Copy link
Member

0x4007 commented Nov 16, 2023

Make appropriate e2e tests using Jest to ensure reliability.

  • This should include the different categories inside of the config such as modifying the command handlers (enable/disable)
  • Setting values related to pricing
  • Setting strings like the "promo comment"

etc

@Keyrxng
Copy link
Member

Keyrxng commented Nov 18, 2023

Is there any sort of docs/whitepaper or anything along those lines that details specifics as to how element scoring works for example?

Can additional elements be added in to the comment elements as per the partner's discretion?

What are the parameters for what can be changed and what shouldn't be changed?

Should the changes be pushed directly into the default_branch or opened as a PR to avoid any mishaps? E2E should cover most cases but committing directly to the working branch seems risky, if there is an issue someone needs to get eyes on anyway, so getting eyes on to approve the review seems like a better UX considering that.

I asked /config I want to refactor the incentive elements scoring between 0 and 5. I want comments which are crafted with care, time and effort to be rewarded for well formatted and fully featured responses.

and it added a whole bunch of new elements, is it a lot more restrictive than that I guess?

@Keyrxng
Copy link
Member

Keyrxng commented Nov 18, 2023

What is the preferred structure for tests, I'm assuming just write them into /tests as individual files for each PR covering invocation to execution?

I see that you have your tests for issue in it's own dir, should all handlers that need it be put into their own dir with their respective tests and mocks?

@Keyrxng
Copy link
Member

Keyrxng commented Nov 19, 2023

I'm struggling to conceptualize effective tests atm

  • Admin calls /config with any arbitrary language
  • GPT calls generateConfiguration() and gets the same config that is passed to BindEvents
  • GPT injects the inferred language then calls 'validateBotConfig()` which uses AJV to validate.
  • GPT corrects the errors and then revalidates (looping until it passes validation)
  • GPT calls pushConfigToRepo() using Octokit's: listCommits(), createBlob(), createTree(), createCommit(), updateRef()

My thoughts are:

  • Since AJV is being used for validation we have type safety against the schema, am I to test that it emits the correct errors for invalid types?
  • As AJV covers schema-safety we catch any and all errors (eg in Logs below)
  • Because the main thing here is the NLP, my full being is saying mocking calls defeats the purpose so I look at the setting of values to be the main point of testing but AJV is covering that so mocking values there again defeats the purpose
  • As for the value of numeric types, I'll create a validation chain that will ensure numbers are as they should be number/bigint etc but how can I test that effectively? I said:

/config scale all of the timers by a factor of 3 set the chain id to 100 and make the max payout price $3000

"issueCreatorMultiplier": 3,
    "maxPermitPrice": 1000
  • The spec to me reads like 'Ensure the final values set are of correct type, value and are fit for purpose" like for instance the above, issueCreatormultiplier of 3 to me seems wrong in terms of how the bot is intended to function, it's a correct value and partners COULD if they wanted to but as far as GPT is concerned anything between 0 and MAX_SAFE_INT is allowed for permits (a minimum should be defined here that is reasonable, $1 at least) and the multipliers, should they be number or a union of reasonable multipliers? 1-10 & 15 & 20 etc? It seems unreasonable to me, as in who in their right mind is going to be applying multipliers of 3.2 or 122 or 25000? More likely it'll be 1-5 and very rarely a double digit I'd guess no more than 25x. Idk what are your thoughts on this?

  • So for me to be able to make effective tests I need to know the constraints that should be applied to GPT when it's updating the config. Obviously all props are mutable but surely there is a set of rules we can impose on at least certain properties, or structure a guideline of sorts so it has a better conceptual understanding of what the key's and their values actually do and what knock-on effects changing something to the correct type but the wrong value are

  • I can make educated guesses (probs less effectively than GPT lmao) as to what everything should be but specifics like comment scoring and additional elements is daunting only because I don't know the ins-and-outs or complete vision for it.

P.S: When it comes to NFT permits are they using maxPermitPrice = 1 or are they having their own config object setup?

Logs

COMMAND:

/config I want to refactor the incentive elements scoring between 0 and 5. I want comments which are crafted with care, time and effort to be rewarded for well formatted and fully featured responses.

It typically enters the inferred key:value incorrect and then after it reads the validation errors (can chain upto 3, 4 times depending on the prompt) it easily resolves them.

      '  "incentive_elements_scoring": "0-5",\n' +
      '  "reward_for_well_formatted_responses": "false",\n' +
      '  "reward_for_fully_featured_responses": "false"\n' +
      '


      '  {\n' +
      '    "instancePath": "",\n' +
      '    "schemaPath": "#/additionalProperties",\n' +
      '    "keyword": "additionalProperties",\n' +
      '    "params": {\n' +
      '      "additionalProperty": "reward_for_fully_featured_responses"\n' +
      '    },\n' +
      '    "message": "must NOT have additional properties"\n' +
      '  }\n'

@0x4007
Copy link
Member Author

0x4007 commented Nov 20, 2023

Is there any sort of docs/whitepaper or anything along those lines that details specifics as to how element scoring works for example?

I can share the philosophy behind this.

The idea is that partners can credit comments that are crafted with care. The configuration technically makes this possible to process every comment with granular precision (down to the tag level as you're aware) but it is up to the partner's discretion as to exactly how they are processed and credited. I imagine that we will experiment within Ubiquity and recommend default settings to our partners based on our internal results.

I've noticed that comments written with lists generally are higher quality (i.e. more informative and expressive) than those without. Comments with links as context/evidence and images also generally are significantly more informative/valuable than comments with little-to-no-formatting. This is based off of anecdotal evidence.

That is the inspiration behind this technology. Regarding how it works, you set a price that is credited every time the HTML tag appears in the comment. You can also choose to ignore crediting of specific HTML tags (e.g. blockquotes, why would you get credited for somebody else's contribution?)

Can additional elements be added in to the comment elements as per the partner's discretion?

Yes it is designed to be fully configurable with support for every HTML entity.

What are the parameters for what can be changed and what shouldn't be changed?

We can start simple with some of the major ones. I am unsure off hand but probably makes sense to focus on things that are likely to get changed frequently, or are less ambiguous on what are makes sense for sensible values. I wouldn't know 100% without spending time on the code and experimenting.

pushed directly into the default_branch

If it is a stable functionality (runtime tests can help determine this) then it should push to the default branch.

I asked /config I want to refactor the incentive elements scoring between 0 and 5. I want comments which are crafted with care, time and effort to be rewarded for well formatted and fully featured responses.

I think you overestimated the LLM's abilities without the context/anecdotal evidence I have of reviewing comments over the years on GitHub. You'll need to somehow provide the LLM with that context in order to produce good results for this type of query.

I see that you have your tests for issue in it's own dir, should all handlers that need it be put into their own dir with their respective tests and mocks?

We should have the tests next to the code that is being tested. That's why as I understand it, Jest etc use globbing to find files with .test. in the name instead of just attempting to run all TypeScript from a specific directory.

Since AJV is being used for validation we have type safety against the schema, am I to test that it emits the correct errors for invalid types?

MAX_SAFE_INT is allowed for permits (a minimum should be defined here that is reasonable, $1 at least)

Using AJV to test the results is very valuable. We should try and define all constraints using AJV. It is a concise and unambiguous way to define expected "correct" results, both for manually changing the configuration or ChatGPT doing so.

I can make educated guesses (probs less effectively than GPT lmao) as to what everything should be but specifics like comment scoring and additional elements is daunting only because I don't know the ins-and-outs or complete vision for it.

I think a more effective query would be the following:

/config credit LI $1 each, all header tags (H1-H6) $1 each, and images $5. Everything else should be ignored.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants