This project allows you to create and run LLM judges based on annotated datasets using Weights & Biases (wandb) and Weave for tracking and tracing.
- Create a
.env
file in the project root with the following variables:
WANDB_EMAIL=your_wandb_email
WANDB_API_KEY=your_wandb_api_key
OPENAI_API_KEY=your_openai_api_key
- Install the required dependencies.
To start the annotation app, run:
python main.py
This will launch a web interface for annotating your dataset.
To programmatically create an LLM judge from your wandb dataset annotations:
- Open
forge_evaluation_judge.ipynb
in a Jupyter environment. - Run all cells in the notebook.
This will generate a judge like the one in forged_judge
.
To load and run the generated judge:
- Open
run_forged_judge.ipynb
in a Jupyter environment. - Run all cells in the notebook.
This will evaluate your dataset using the forged judge, with results fully tracked and traced using Weave.
main.py
: Annotation appforge_evaluation_judge.ipynb
: Judge creation notebookrun_forged_judge.ipynb
: Judge execution notebook
All components are integrated with Weave for comprehensive tracking and tracing of your machine learning workflow.
Happy evaluating! 🎉