You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+45-11Lines changed: 45 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,41 +18,75 @@ To develop a new evaluation:
18
18
19
19
3. Create an `__init__.py` file in the directory and use it to export your task and other useful functions.
20
20
21
-
4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/_registry.py).
21
+
4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](src/inspect_evals/_registry.py).
22
+
22
23
5. Confirm that your evaluation is properly registered by running it:
23
24
24
25
```bash
25
-
inspect eval inspect_evals/<my-evaluation>
26
+
inspect eval inspect_evals/<my-task>
26
27
```
27
28
28
29
## Submission
29
30
30
31
To prepare an evaluation for submission as a pull request:
31
32
32
-
1.Create a `README.md` file in your sub-directory following the template used by other evaluations.
33
+
1. Create a `README.md` file in your sub-directory following the template used by other evaluations.
33
34
34
-
2. Run linting, formatting, and type checking and address any issues:
35
+
2. Add custom dependencies in the [pyproject.toml](pyproject.toml) file. Don't forget to add your evaluation to the `[test]` dependencies.
36
+
37
+
```bash
38
+
[project.optional-dependencies]
39
+
worldsense = ["pandas"]
40
+
your_evaluation = ["example-python-package"]
41
+
42
+
test = [
43
+
"inspect_evals[dev]",
44
+
"inspect_evals[your_evaluation]"
45
+
]
46
+
```
47
+
48
+
Pin the package version if your evaluation might be sensitive to version updates.
49
+
50
+
3. Run linting, formatting, and type checking and address any issues:
35
51
36
52
```bash
37
53
make check
38
54
```
39
55
40
56
Note that we automatically run the `make check` commands on all incoming PRs, so be sure your contribution passes those before opening a PR.
41
57
42
-
3. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation.
58
+
4. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation.
43
59
44
60
Comparison results between existing implementations and your implementation should be included in the pull request description.
45
61
46
-
## Examples
62
+
5. Add your evaluation to [tools/listing.yaml](tools/listing.yaml) to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).
63
+
64
+
```yaml
65
+
- title: "Example-Bench: Your Amazing AI Benchmark"
66
+
description: Describe what the benchmark is about here.
67
+
path: src/inspect_evals/your_evaluation
68
+
arxiv: https://arxiv.org/abs/1234.12345
69
+
group: Coding
70
+
contributors: ["your-github-handle"]
71
+
tasks: ["task-name"]
72
+
dependency: "your_evaluation"# optional field for custom dependency from pyproject.toml
73
+
tags: ["Agent"] # optional tag for agentic evaluations
74
+
```
47
75
48
-
Here are some existing evaluations that serve as good examples of what is required in a new submission:
76
+
Don't forget to add the custom dependency from [pyproject.toml](pyproject.toml) if applicable.
77
+
78
+
6. Run [tools/listing.py](tools/listing.py) to format your evaluations' README file according to our standards.
49
79
50
-
-[GPQA](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gpqa), a simple multiple-choice evaluation
80
+
You're expected to receive a PR review in a couple of days.
51
81
52
-
-[GSM8K](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/gsm8k/gsm8k.py), a mathematics task with fewshot prompting,
82
+
## Examples
83
+
84
+
Here are some existing evaluations that serve as good examples of what is required in a new submission:
53
85
54
-
-[HumanEval](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/humaneval), a Python coding task.
86
+
- [GPQA](src/inspect_evals/gpqa), a simple multiple-choice evaluation
55
87
56
-
-[InterCode](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.
88
+
- [GSM8K](src/inspect_evals/gsm8k), a mathematics task with fewshot prompting,
57
89
90
+
- [HumanEval](src/inspect_evals/humaneval), a Python coding task.
58
91
92
+
- [InterCode](src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.
0 commit comments