Skip to content

A fun 'benchmark-y' puzzle/escape room to test your prompting skills, or vice versa, your puzzle writing skills.

License

Notifications You must be signed in to change notification settings

scruffynerf/LLMAgentEscapeRoom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

LLMAgentEscapeRoom

A fun 'benchmark-y' puzzle/escape room to test your prompting skills, or vice versa, your puzzle writing skills.

The idea:

The agent framework is setup with the tools the agent will have available to them.

These tools will unlock successive items (often new tools, but perhaps other things, TBD) each of which requires successfully navigating/solving to unlock the next step, eventually Opening the Vault, which has the secret phrase 'TERMINATE', which will end the agent's run.

The run begins with the User Request "Hello. I need to open the vault!" and the agent has to take it from there. No actual user interaction is allowed, so you can't give hints, berate them, or point them in the right direction.
We will allow a set of fixed user replies, to function as important information beacons, along with tools that provide others.

Perhaps a 2 agent version will spin off. TBD.

Measuring the number of turns, mistakes, and so on will help us determine benchmark-y things, in a fun way.

Ways to exercise your skills (and those of your agent):

  • Adding your own Agent Prompt
  • Adding your own code to help your agent (no cheating allowed - rules TBD)
  • Adding new puzzles (making the obstacle course longer or more complicated) - again Rules TBD.

We'll start with the Open the Vault demo I wrote (while coding up some other stuff, and got frustrated at the agent repeatedly doing simple math on it's own instead of using the calculator tool I asked it to, or refusing to use a web search tool and hallcinating the top 5 movies this week, which often seem to including some Quantum Tale of Adventure or other... or telling me the weather tomorrow would be nice, and never bothering to check a weather service. (I guess LLMs know that weathermen are all fake)

More as I flesh this out...

Code will be Autogen-ic first, but porting to other frameworks will be welcomed. Trying to keep things level will be interesting.

About

A fun 'benchmark-y' puzzle/escape room to test your prompting skills, or vice versa, your puzzle writing skills.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published