WorkBench

WorkBench - the first open-source benchmark for evaluating agent performance on realistic workplace tasks. Created by MindsDB. Special thanks to Jorge Torres, Adam Carrigan, and the rest of the MindsDB team for their support. Check out the paper here - https://arxiv.org/abs/2405.00823

Installation

Python Version: 3.10.11

git clone https://github.com/olly-styles/WorkBench.git
cd WorkBench
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

All five sandbox databases, task-outcome pairs, and pre-computed inference results are provided in the data directory.

Evaluation

As all the pre-computed inference results are provided, you can reproduce the evaluation results in the paper without running inference. Use the following script to calculate the metrics.

python scripts/evals/calculate_all_metrics.py;
python scripts/evals/calculate_all_metrics.py --all_tools;

Note that results are not provided for the all_tools variant of GPT3.5 and LLama2-70B as the prompt does not fit into the context window for these models.

Data generation

All generated data is provided pre-computed in the data directory. If you want to generate the data yourself, follow the steps below.

python scripts/data_generation/mocked_data/generate_all_mocked_data.py;
python scripts/data_generation/query_answer_generation/generate_all_query_and_answer.py;

Inference

Pre-computed inference results are provided in the data directory. If you want to run inference yourself, you will need to provide your own API keys.

An openai key is required for GPT-3.5 and GPT-4.
An anthropic key is required for Claude-2.
An anyscale key is required for llama2-70b and mistral-8x7B.

touch openai_key.txt && echo YOUR_OPENAI_API_KEY > openai_key.txt
touch anthropic_key.txt && echo YOUR_ANTHROPIC_API_KEY > anthropic_key.txt
touch anyscale_key.txt && echo YOUR_ANYSCALE_API_KEY > anyscale_key.txt

Run inference for specific domain and model

python scripts/inference/generate_results.py --model_name MODEL_NAME --queries_path QUERIES_PATH

Run inference for all domains and models

python scripts/inference/generate_all_results.py

Run inference for a new agent

Getting results for a new agent dependings on the how different the new agent is from existing agents. Here we go through three possible scenarios:

If the new agent is the same as an existing agent with a different prompt, you can modify the prompt directly.
If the new agent uses an LLM that's supported by LangChain but not by the current implementation, then the new agent can be added to the supported LLMs
To implement a new agent outside of the LangChain framework, update the inference loop

FAQ

What are "queries" and "answers"?

We originally called tasks "queries" and outcomes "answers". We updated the terminology in the paper but have not yet updated the code.

What is "mocked data"?

Similar to the "queries" and "answers" terminology, we originally called the sandbox databases "mocked data". We updated the terminology in the paper but have not yet updated the code.

Can I contact the authors?

Yes! The fastest way to reach us is by opening an issue on this repository. If you want to reach out for any other reason, please send an email to [email protected]

Where's the paper?

https://arxiv.org/abs/2405.00823

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorkBench

Installation

Usage

Evaluation

Data generation

Inference

Run inference for specific domain and model

Run inference for all domains and models

Run inference for a new agent

FAQ

What are "queries" and "answers"?

What is "mocked data"?

Can I contact the authors?

Where's the paper?

About

Releases

Packages

Contributors 2

Languages

License

olly-styles/WorkBench

Folders and files

Latest commit

History

Repository files navigation

WorkBench

Installation

Usage

Evaluation

Data generation

Inference

Run inference for specific domain and model

Run inference for all domains and models

Run inference for a new agent

FAQ

What are "queries" and "answers"?

What is "mocked data"?

Can I contact the authors?

Where's the paper?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages