Welcome to our project dedicated to achieving high scores on SWE-bench Verified. Our goal is to develop and implement strategies that will maximize our performance on this benchmark, which is designed to evaluate AI models' ability to solve real-world software issues.
SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, released by OpenAI. Key points:
- It consists of 500 samples verified by human annotators to be non-problematic.
- It addresses issues in the original dataset, such as overly specific unit tests and underspecified problem statements.
- The benchmark aims to provide a more accurate evaluation of AI models' software engineering capabilities.
- On SWE-bench Verified, GPT-4o resolves 33.2% of samples, more than doubling its previous score on the original SWE-bench.
-
Clone this repository:
git clone [Your Repository URL] cd [Your Repository Name]
-
Install dependencies:
pip install -r requirements.txt
-
Set up Docker (required for evaluation):
- Follow the Docker setup guide
- Ensure you have at least 120GB of free disk space
-
Run a sample evaluation:
python run_evaluation.py --model [Your Model Name] --output_dir ./results
README.md
: This filesetup.md
: Detailed setup instructionsdataset_info.md
: Information about the SWE-bench Verified datasetevaluation_guide.md
: Guide to running evaluationsmodel_development.md
: Strategies for developing high-performing modelsresults/
: Directory for storing evaluation resultssrc/
: Source code for our models and evaluation scripts
- Read through the detailed documentation in this repository
- Set up your development environment
- Familiarize yourself with the SWE-bench Verified dataset
- Start developing and testing model improvements
- Run evaluations and analyze results
We welcome contributions! Please see our CONTRIBUTING.md
file for guidelines on how to contribute to this project.
This project is licensed under the MIT License. See the LICENSE
file for details.
For more detailed information on specific aspects of the project, please refer to the individual markdown files in this repository.