uiuc-kang-lab
diff --git a/‎.github/workflows/build.yml‎
Lines changed: 12 additions & 7 deletions b/‎.github/workflows/build.yml‎
Lines changed: 12 additions & 7 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 45 additions & 11 deletions b/‎CONTRIBUTING.md‎
Lines changed: 45 additions & 11 deletions
@@ -17,12 +17,17 @@ jobs:
         python-version: ["3.10", "3.11"]
     steps:
       - uses: actions/checkout@v4
+      - name: Display Ruff version
+        uses: astral-sh/ruff-action@v3
+        # Installs ruff for use in later steps
+        with:
+          version: "0.9.1"  # Match version specified in .pre-commit-config.yaml and pyproject.toml
+          args: --version
       - name: Lint with Ruff
-        uses: chartboost/ruff-action@v1
+        run: ruff check
       - name: Format with Ruff
-        uses: chartboost/ruff-action@v1
-        with:
-          args: "format --check"
+        run: ruff format --check
+  
 
   mypy:
     runs-on: ubuntu-latest
@@ -45,10 +50,10 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          python -m pip install .[dev]
+          python -m pip install .[test]
       - name: Run mypy
         run: |
-          mypy .
+          mypy --version && mypy .
 
   test:
     runs-on: ubuntu-latest
@@ -72,7 +77,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          python -m pip install .[dev]
+          python -m pip install .[test]
       - name: Test with pytest
         run: |
           pytest -rA --doctest-modules --color=yes --cov=inspect_evals
 
@@ -5,7 +5,7 @@ default_language_version:
   python: python3.11
 repos:
 - repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.7.2
+  rev: v0.9.1  # Match version specified in pyproject.toml and .github/workflows/build.yml
   hooks:
     # Run the linter.
     - id: ruff
 
@@ -18,41 +18,75 @@ To develop a new evaluation:
 
 3. Create an `__init__.py` file in the directory and use it to export your task and other useful functions.
 
-4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/_registry.py).
+4. Register your task(s) for execution by the `inspect eval` CLI command by importing into [src/inspect_evals/_registry.py](src/inspect_evals/_registry.py).
+
 5. Confirm that your evaluation is properly registered by running it:
 
    ```bash
-   inspect eval inspect_evals/<my-evaluation> 
+   inspect eval inspect_evals/<my-task>
    ```
 
 ## Submission
 
 To prepare an evaluation for submission as a pull request:
 
-1.  Create a `README.md` file in your sub-directory following the template used by other evaluations.
+1. Create a `README.md` file in your sub-directory following the template used by other evaluations.
 
-2. Run linting, formatting, and type checking and address any issues:
+2. Add custom dependencies in the [pyproject.toml](pyproject.toml) file. Don't forget to add your evaluation to the `[test]` dependencies.
+
+   ```bash
+   [project.optional-dependencies]
+   worldsense = ["pandas"]
+   your_evaluation = ["example-python-package"]
+
+   test = [
+      "inspect_evals[dev]",
+      "inspect_evals[your_evaluation]"
+   ]
+   ```
+
+   Pin the package version if your evaluation might be sensitive to version updates.
+
+3. Run linting, formatting, and type checking and address any issues:
 
    ```bash
    make check
    ```
 
    Note that we automatically run the `make check` commands on all incoming PRs, so be sure your contribution passes those before opening a PR.
 
-3. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. 
+4. If an evaluation is based on an existing paper or ported from another evaluation suite, you should validate that it roughly matches the results reported there before submitting your PR. This is to ensure that your new implementation faithfully reproduces the original evaluation. 
 
    Comparison results between existing implementations and your implementation should be included in the pull request description.
 
-## Examples
+5. Add your evaluation to [tools/listing.yaml](tools/listing.yaml) to include it in the documentation (which is automatically generated and deployed by Github Actions after merge).
+
+   ```yaml
+   - title: "Example-Bench: Your Amazing AI Benchmark"
+     description: Describe what the benchmark is about here.
+     path: src/inspect_evals/your_evaluation
+     arxiv: https://arxiv.org/abs/1234.12345
+     group: Coding
+     contributors: ["your-github-handle"]
+     tasks: ["task-name"]
+     dependency: "your_evaluation"  # optional field for custom dependency from pyproject.toml
+     tags: ["Agent"]  # optional tag for agentic evaluations
+   ```
 
-Here are some existing evaluations that serve as good examples of what is required in a new submission:
+   Don't forget to add the custom dependency from [pyproject.toml](pyproject.toml) if applicable.
+
+6. Run [tools/listing.py](tools/listing.py) to format your evaluations' README file according to our standards.
 
-- [GPQA](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gpqa), a simple multiple-choice evaluation
+You're expected to receive a PR review in a couple of days.
 
-- [GSM8K](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/gsm8k/gsm8k.py), a mathematics task with fewshot prompting,
+## Examples
+
+Here are some existing evaluations that serve as good examples of what is required in a new submission:
 
-- [HumanEval](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/humaneval), a Python coding task.
+- [GPQA](src/inspect_evals/gpqa), a simple multiple-choice evaluation
 
-- [InterCode](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.
+- [GSM8K](src/inspect_evals/gsm8k), a mathematics task with fewshot prompting,
 
+- [HumanEval](src/inspect_evals/humaneval), a Python coding task.
 
+- [InterCode](src/inspect_evals/gdm_capabilities/intercode_ctf), a capture the flag (CTF) cybersecurity task.