Skip to content

Commit 0b69c47

Browse files
authored
Merge branch 'main' into vhong/cybench_lang
2 parents 9506b08 + 9574e4d commit 0b69c47

File tree

105 files changed

+3902
-514
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+3902
-514
lines changed

.github/workflows/build.yml

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,17 @@ jobs:
1717
python-version: ["3.10", "3.11"]
1818
steps:
1919
- uses: actions/checkout@v4
20+
- name: Display Ruff version
21+
uses: astral-sh/ruff-action@v3
22+
# Installs ruff for use in later steps
23+
with:
24+
version: "0.9.1" # Match version specified in .pre-commit-config.yaml and pyproject.toml
25+
args: --version
2026
- name: Lint with Ruff
21-
uses: chartboost/ruff-action@v1
27+
run: ruff check
2228
- name: Format with Ruff
23-
uses: chartboost/ruff-action@v1
24-
with:
25-
args: "format --check"
29+
run: ruff format --check
30+
2631

2732
mypy:
2833
runs-on: ubuntu-latest
@@ -45,10 +50,10 @@ jobs:
4550
- name: Install dependencies
4651
run: |
4752
python -m pip install --upgrade pip
48-
python -m pip install .[dev]
53+
python -m pip install .[test]
4954
- name: Run mypy
5055
run: |
51-
mypy .
56+
mypy --version && mypy .
5257
5358
test:
5459
runs-on: ubuntu-latest
@@ -72,7 +77,7 @@ jobs:
7277
- name: Install dependencies
7378
run: |
7479
python -m pip install --upgrade pip
75-
python -m pip install .[dev]
80+
python -m pip install .[test]
7681
- name: Test with pytest
7782
run: |
7883
pytest -rA --doctest-modules --color=yes --cov=inspect_evals

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ default_language_version:
55
python: python3.11
66
repos:
77
- repo: https://github.com/astral-sh/ruff-pre-commit
8-
rev: v0.7.2
8+
rev: v0.9.1 # Match version specified in pyproject.toml and .github/workflows/build.yml
99
hooks:
1010
# Run the linter.
1111
- id: ruff

CONTRIBUTING.md

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,41 +18,75 @@ To develop a new evaluation:
1818

1919
3. Create an `__init__.py` file in the directory and use it to export your task and other useful functions.
2020

21-
4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/_registry.py).
21+
4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](src/inspect_evals/_registry.py).
22+
2223
5. Confirm that your evaluation is properly registered by running it:
2324

2425
```bash
25-
inspect eval inspect_evals/<my-evaluation>
26+
inspect eval inspect_evals/<my-task>
2627
```
2728

2829
## Submission
2930

3031
To prepare an evaluation for submission as a pull request:
3132

32-
1. Create a `README.md` file in your sub-directory following the template used by other evaluations.
33+
1. Create a `README.md` file in your sub-directory following the template used by other evaluations.
3334

34-
2. Run linting, formatting, and type checking and address any issues:
35+
2. Add custom dependencies in the [pyproject.toml](pyproject.toml) file. Don't forget to add your evaluation to the `[test]` dependencies.
36+
37+
```bash
38+
[project.optional-dependencies]
39+
worldsense = ["pandas"]
40+
your_evaluation = ["example-python-package"]
41+
42+
test = [
43+
"inspect_evals[dev]",
44+
"inspect_evals[your_evaluation]"
45+
]
46+
```
47+
48+
Pin the package version if your evaluation might be sensitive to version updates.
49+
50+
3. Run linting, formatting, and type checking and address any issues:
3551

3652
```bash
3753
make check
3854
```
3955

4056
Note that we automatically run the `make check` commands on all incoming PRs, so be sure your contribution passes those before opening a PR.
4157

42-
3. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation.
58+
4. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation.
4359

4460
Comparison results between existing implementations and your implementation should be included in the pull request description.
4561

46-
## Examples
62+
5. Add your evaluation to [tools/listing.yaml](tools/listing.yaml) to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).
63+
64+
```yaml
65+
- title: "Example-Bench: Your Amazing AI Benchmark"
66+
description: Describe what the benchmark is about here.
67+
path: src/inspect_evals/your_evaluation
68+
arxiv: https://arxiv.org/abs/1234.12345
69+
group: Coding
70+
contributors: ["your-github-handle"]
71+
tasks: ["task-name"]
72+
dependency: "your_evaluation" # optional field for custom dependency from pyproject.toml
73+
tags: ["Agent"] # optional tag for agentic evaluations
74+
```
4775
48-
Here are some existing evaluations that serve as good examples of what is required in a new submission:
76+
Don't forget to add the custom dependency from [pyproject.toml](pyproject.toml) if applicable.
77+
78+
6. Run [tools/listing.py](tools/listing.py) to format your evaluations' README file according to our standards.
4979
50-
- [GPQA](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gpqa), a simple multiple-choice evaluation
80+
You're expected to receive a PR review in a couple of days.
5181
52-
- [GSM8K](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/gsm8k/gsm8k.py), a mathematics task with fewshot prompting,
82+
## Examples
83+
84+
Here are some existing evaluations that serve as good examples of what is required in a new submission:
5385
54-
- [HumanEval](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/humaneval), a Python coding task.
86+
- [GPQA](src/inspect_evals/gpqa), a simple multiple-choice evaluation
5587
56-
- [InterCode](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.
88+
- [GSM8K](src/inspect_evals/gsm8k), a mathematics task with fewshot prompting,
5789
90+
- [HumanEval](src/inspect_evals/humaneval), a Python coding task.
5891
92+
- [InterCode](src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.

0 commit comments

Comments
 (0)