evaluation-framework

Here are 127 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jul 4, 2024
TypeScript

huggingface / lighteval

Star

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

evaluation evaluation-metrics evaluation-framework huggingface

Updated Jul 4, 2024
Python

relari-ai / continuous-eval

Star

Open-Source Evaluation for LLM Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Jul 4, 2024
Python

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jul 4, 2024
Python

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated Jul 4, 2024
Go

EleutherAI / lm-evaluation-harness

Star

A framework for few-shot evaluation of language models.

transformer language-model evaluation-framework

Updated Jul 3, 2024
Python

kolenaIO / kolena

Star

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated Jul 4, 2024
Python

kaiko-ai / eva

Star

Evaluation framework for oncology foundation models (FMs)

machine-learning evaluation-framework oncology foundation-models

Updated Jul 4, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jul 4, 2024
Python

encord-team / text-to-image-eval

Star

Evaluate custom and HuggingFace text-to-image/zero-shot-image-classification models like CLIP, SigLIP, DFN5B, and EVA-CLIP. Metrics include Zero-shot accuracy, Linear Probe, Image retrieval, and KNN accuracy.

knn-search evaluation-metrics evaluation-framework linear-probing embedding-evaluation zero-shot-retrieval zero-shot-classification model-evaluation-metrics embeddings-extraction zero-shot-image-classification text-to-image-evaluation

Updated Jul 3, 2024
Jupyter Notebook

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jul 3, 2024
Python

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Jul 3, 2024
Python

Cybonto / OllaBench

Star

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

artificial-intelligence cybersecurity cognitive-science evaluation-framework interdependent-networks large-language-models

Updated Jul 3, 2024
Jupyter Notebook

aiverify-foundation / moonshot

Star

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework red-teaming trustworthy-ai llm

Updated Jul 3, 2024
Python

jinzhuoran / RWKU

Star

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

benchmark natural-language-processing right-to-be-forgotten evaluation-framework privacy-protection adversarial-attacks forgetting membership-inference-attack unlearning large-language-models

Updated Jul 2, 2024
Python

lapix-ufsc / lapixdl

Star

Python package with Deep Learning utilities for Computer Vision

computer-vision deep-learning image-processing evaluation-framework

Updated Jul 2, 2024
Python

TonicAI / tonic_validate

Star

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics evaluation-framework rag large-language-models llm llms llmops retrieval-augmented-generation

Updated Jul 1, 2024
Python

OPTML-Group / Unlearn-WorstCase

Star

"Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning" by Chongyu Fan*, Jiancheng Liu*, Alfred Hero, Sijia Liu

evaluation data-privacy evaluation-framework machine-unlearning forgetting data-deletion unlearning data-removal

Updated Jul 1, 2024
Python

OPTML-Group / Diffusion-MU-Attack

Star

The official implementation of ECCV'24 paper "To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now". This work introduces one fast and effective attack method to evaluate the harmful-content generation ability of safety-driven unlearned diffusion models.

evaluation-framework robustness adversarial-attacks unlearning stable-diffusion attack-unlearned-diffusion-model

Updated Jul 1, 2024
Python

chziakas / redeval

Star

Red-teaming LLM applications.

evaluation-framework redteaming llm llmops

Updated Jun 24, 2024
Python

Improve this page

Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-framework

Here are 127 public repositories matching this topic...

promptfoo / promptfoo

huggingface / lighteval

relari-ai / continuous-eval

confident-ai / deepeval

symflower / eval-dev-quality

EleutherAI / lm-evaluation-harness

kolenaIO / kolena

kaiko-ai / eva

Psycoy / MixEval

encord-team / text-to-image-eval

athina-ai / athina-evals

Giskard-AI / giskard

Cybonto / OllaBench

aiverify-foundation / moonshot

jinzhuoran / RWKU

lapix-ufsc / lapixdl

TonicAI / tonic_validate

OPTML-Group / Unlearn-WorstCase

OPTML-Group / Diffusion-MU-Attack

chziakas / redeval

Improve this page

Add this topic to your repo