Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(weave): Add xtra Scorers to Weave #3006

Open
wants to merge 279 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 128 commits
Commits
Show all changes
279 commits
Select commit Hold shift + click to select a range
c83cf16
set temp to 0
tcapelle Nov 25, 2024
8d051fd
roll back
tcapelle Nov 25, 2024
f24e128
another typo
tcapelle Nov 25, 2024
520bc1e
multi cat
tcapelle Nov 25, 2024
bf47b87
unset temp
tcapelle Nov 25, 2024
b3c08e3
make generate_config a param
tcapelle Nov 25, 2024
a19d62c
fix conversation and default
tcapelle Nov 25, 2024
77b7603
typing
tcapelle Nov 25, 2024
128eb8a
remove print
tcapelle Nov 25, 2024
39a544b
argument name change + interpretation string
ayulockin Nov 26, 2024
5f5833c
make imports optional
ayulockin Nov 26, 2024
24ac09e
lint
ayulockin Nov 26, 2024
2038d52
Merge branch 'master' into xtra-scorers
ayulockin Nov 26, 2024
450d72c
default to latest moderation
tcapelle Nov 26, 2024
5325460
renaming and fixing
tcapelle Nov 26, 2024
9e03cc5
fix private attr
tcapelle Nov 26, 2024
7684e53
missing priv
tcapelle Nov 26, 2024
55dc69e
update robustness scorer with more control knobs
ayulockin Nov 27, 2024
55091cb
collapse interpretation + return abs and sign
ayulockin Nov 27, 2024
45ed447
Merge branch 'master' into xtra-scorers
ayulockin Nov 28, 2024
538d149
add bleu scorer
ayulockin Nov 28, 2024
f68f007
docstrings and unifications
tcapelle Nov 28, 2024
9ed0d7d
lint + formatting
ayulockin Nov 28, 2024
e304e9c
rouge scorer init
ayulockin Nov 29, 2024
24bd665
add tests + formatting
ayulockin Nov 29, 2024
075a0a3
formatting tests
ayulockin Nov 29, 2024
eaccca1
feat: add relevance scorer impl
parambharat Nov 29, 2024
f4fbfe0
fix: move system prompt to upper level
parambharat Nov 29, 2024
62210a9
add relevance scorer to scorers init
morganmcg1 Nov 30, 2024
ed1760b
remove async from relevance score func
morganmcg1 Nov 30, 2024
dc341ce
change relevance scorer output to is_relevant
morganmcg1 Nov 30, 2024
54ae8d7
remove async from coherence scorer
morganmcg1 Dec 1, 2024
8e4b73a
change coherent to is_coherent in scorer
morganmcg1 Dec 1, 2024
6d6b87c
return_all_scores
tcapelle Dec 1, 2024
e41a8e9
fix: remove sampling params from relevance scorer to make it determin…
parambharat Dec 2, 2024
4cfe205
feat: try Contrastive search sampling
parambharat Dec 2, 2024
a1370e4
perplexity scorer finalize
ayulockin Dec 2, 2024
8647bdb
add tests for perplexity
ayulockin Dec 2, 2024
5767207
formatting
ayulockin Dec 2, 2024
00626f2
minor format
ayulockin Dec 2, 2024
43d87c7
name_or_path
tcapelle Dec 2, 2024
6865378
name_or_path
tcapelle Dec 2, 2024
d6b44a6
accuracy scorer -- binary
ayulockin Dec 2, 2024
bf22114
add base_url
tcapelle Dec 2, 2024
a961c9c
update __init__.py
ayulockin Dec 2, 2024
8aaa083
add hallucination scorer local
morganmcg1 Dec 2, 2024
8472976
Hallu scorer fixes
morganmcg1 Dec 2, 2024
f1c7a4e
fix hallu scoroer
morganmcg1 Dec 2, 2024
6dab784
fix hallu scorer
morganmcg1 Dec 2, 2024
7d7f153
add artifactss download to hallu
morganmcg1 Dec 2, 2024
b0eedbb
fix hallu download
morganmcg1 Dec 2, 2024
c031a66
fix hallu download
morganmcg1 Dec 2, 2024
7c3b5d6
fix hallu download
morganmcg1 Dec 2, 2024
acede5a
fix hallu download
morganmcg1 Dec 2, 2024
67c941b
fix hallu download
morganmcg1 Dec 2, 2024
c2cd2ac
add api bias scorer
tcapelle Dec 2, 2024
d6a6f7a
rename bias checkpoint
tcapelle Dec 2, 2024
90d2bea
fix scorer
morganmcg1 Dec 2, 2024
2781494
add model paths for scorers
morganmcg1 Dec 3, 2024
a978da5
remove unused _download
tcapelle Dec 3, 2024
bb9416b
update bias
tcapelle Dec 3, 2024
920f2dd
add hhem hallucination
morganmcg1 Dec 3, 2024
ad15579
fix hhem
morganmcg1 Dec 3, 2024
c18edb2
fix tokenizer
morganmcg1 Dec 3, 2024
3bad7c7
faithful
tcapelle Dec 3, 2024
e27cd75
lint and small fixes
tcapelle Dec 3, 2024
233b8b8
release
tcapelle Dec 3, 2024
259a40d
correct Hallucination base
tcapelle Dec 3, 2024
9c296f9
typo
tcapelle Dec 3, 2024
8de823c
cohen's d threshold
ayulockin Dec 3, 2024
cbedb98
fix paths
morganmcg1 Dec 3, 2024
4d9defe
fix sig
morganmcg1 Dec 3, 2024
7b84e33
fix sig
morganmcg1 Dec 3, 2024
458b8f6
Merge branch 'master' into xtra-scorers
morganmcg1 Dec 10, 2024
d06804b
Align output to "flagged"
morganmcg1 Dec 10, 2024
9877d56
hallu
morganmcg1 Dec 10, 2024
517a901
remove predictions print from rollingwindowscoroer
morganmcg1 Dec 10, 2024
2d908a7
remove print
morganmcg1 Dec 10, 2024
30f5f7f
relevance scorer
tcapelle Dec 10, 2024
5194992
return_all_scores
tcapelle Dec 10, 2024
3b954ff
add output
tcapelle Dec 10, 2024
7af240c
remove torch
tcapelle Dec 10, 2024
5bc7920
tupo
tcapelle Dec 10, 2024
ad044fa
add context relevance scorer artifact
morganmcg1 Dec 10, 2024
e773622
Add artifcacts download for relevance scorer
morganmcg1 Dec 10, 2024
8c80e8d
relevance scorer
morganmcg1 Dec 10, 2024
37c6b44
add ContextRelevanceScorer
morganmcg1 Dec 10, 2024
3b0a133
hide query
tcapelle Dec 10, 2024
19707f6
rename relevance to context relevance
morganmcg1 Dec 10, 2024
5457b3c
update relevance scorer
morganmcg1 Dec 10, 2024
95f266a
update output to res
morganmcg1 Dec 10, 2024
d7e3210
rename documents to context in relevance scorer
morganmcg1 Dec 10, 2024
4e65547
mask query tokens on the combined mask
tcapelle Dec 10, 2024
a17bf51
flat
tcapelle Dec 10, 2024
5b13022
add verbose param
tcapelle Dec 10, 2024
85feae9
add verbose param
tcapelle Dec 10, 2024
39dd40e
more readable
tcapelle Dec 10, 2024
b37a385
return_all_scores -> verbose
tcapelle Dec 10, 2024
60bf3e0
Fix truncation for hallu, coherence, cont rele
morganmcg1 Dec 11, 2024
68c1b5e
test(weave): Add large input tests for scorers
devin-ai-integration[bot] Dec 13, 2024
87820b4
test(weave): Add large input tests for scorers
devin-ai-integration[bot] Dec 13, 2024
6d6e068
test(weave): Add flagging tests for scorers
devin-ai-integration[bot] Dec 13, 2024
4683cd2
Update hallucination scorer tests to use HHEM model and improve test …
openhands-agent Dec 17, 2024
7007d41
Simplify hallucination scorer tests to use actual HHEM model
morganmcg1 Dec 17, 2024
583e175
test(weave): Update scorer tests to use actual ML models
devin-ai-integration[bot] Dec 17, 2024
434c4e8
feat(weave): Add test model configurations
devin-ai-integration[bot] Dec 17, 2024
9fdbf54
chore(weave): Update gitignore for test model files
devin-ai-integration[bot] Dec 17, 2024
b51573d
feat(weave): Add model download scripts for test setup
devin-ai-integration[bot] Dec 17, 2024
4098f0c
update hallu scorer tests
morganmcg1 Dec 17, 2024
ce11237
Merge branch 'xtra-scorers' of https://github.com/wandb/weave into xt…
morganmcg1 Dec 17, 2024
e194034
Merge branch 'master' into xtra-scorers
tcapelle Dec 19, 2024
f6aac2e
Merge branch 'master' into xtra-scorers
tcapelle Dec 19, 2024
295c2f0
fix imports, remove unused code
tcapelle Dec 19, 2024
7a4be18
extract HF and rolling window stuff, add tests
tcapelle Dec 19, 2024
8e88178
refactor: rename scorer classes for clarity and consistency
tcapelle Dec 19, 2024
f7ac57a
add tiny model
tcapelle Dec 19, 2024
72cb919
refactor: remove unused RollingWindowScorer from imports and __all__ …
tcapelle Dec 19, 2024
f45bc27
refactor(tests): simplify ToxicityScorer tests and update model path
tcapelle Dec 19, 2024
e7a97c0
fixed flan T5 packing and tests!
tcapelle Dec 19, 2024
2907663
refactor(tests): update LlamaGuard tests and rename scorer class
tcapelle Dec 19, 2024
8fb46eb
remove unused scorer
tcapelle Dec 19, 2024
8fe3f83
typo
tcapelle Dec 19, 2024
d6b7382
refactor(tests): update CoherenceScorer tests and model path
tcapelle Dec 19, 2024
00c3198
Add bias tests
tcapelle Dec 19, 2024
5871d75
lint
tcapelle Dec 19, 2024
fd7077b
lint
tcapelle Dec 19, 2024
8fc9a55
remove unused code
tcapelle Dec 19, 2024
43c0f91
lazy import
tcapelle Dec 19, 2024
b0401de
lazy import
tcapelle Dec 19, 2024
2408d0f
hf scorers and set_device
tcapelle Dec 19, 2024
e68e507
remove unused tests
tcapelle Dec 19, 2024
3694fdb
remove not useful test
tcapelle Dec 20, 2024
0057879
undo Devin
tcapelle Dec 20, 2024
a9d036a
rework model/tokenizer logic
tcapelle Dec 20, 2024
29bf188
mypy
tcapelle Dec 20, 2024
d5bcc4c
Merge branch 'master' into xtra-scorers
tcapelle Dec 20, 2024
c15033c
more MyPy
tcapelle Dec 20, 2024
9a58f1f
Union
tcapelle Dec 20, 2024
2e95abb
fix test
tcapelle Dec 20, 2024
674adb7
no more lint!!
tcapelle Dec 20, 2024
aaff52b
parasite files
tcapelle Dec 20, 2024
d1dfafe
typo
tcapelle Dec 20, 2024
5512ef4
missing TYPE_CHECKING
tcapelle Dec 20, 2024
7b985a4
same...
tcapelle Dec 20, 2024
0c17734
Trigger CI
tcapelle Dec 20, 2024
c294469
Added WANDB_API_KEY to GitHub Actions workflow for enhanced integrati…
tcapelle Dec 20, 2024
a4ae657
unused import
tcapelle Dec 20, 2024
0d4a419
make sync
tcapelle Jan 13, 2025
49b901c
Merge branch 'master' into xtra-scorers
tcapelle Jan 13, 2025
77c6166
remove verbose stuff
tcapelle Jan 13, 2025
725ead0
make test sync
tcapelle Jan 13, 2025
d9d2444
Merge branch 'master' into xtra-scorers
tcapelle Jan 13, 2025
5b0e55d
typo
tcapelle Jan 14, 2025
7712d89
Merge branch 'master' into xtra-scorers
morganmcg1 Jan 27, 2025
8a1badd
Reverse logic of "flagged" in ContextRelevance scorer
morganmcg1 Jan 27, 2025
186edc0
Adjust threshold for ContextRelevance scorer
morganmcg1 Jan 27, 2025
6828cbc
Remove "output" arg from ContextRelevance inputs
morganmcg1 Jan 27, 2025
21294c3
Rename "coherence_score" key to "score" for consistency
morganmcg1 Jan 27, 2025
a1b01f8
add fluency
tcapelle Jan 27, 2025
0023919
eager import
tcapelle Jan 27, 2025
4239cf5
udpate device vaqlue
tcapelle Jan 27, 2025
bec2ad3
import on top
tcapelle Jan 27, 2025
3826609
wrong improt
tcapelle Jan 27, 2025
58aff6d
annotate
tcapelle Jan 27, 2025
d7eafe5
re-work logic
tcapelle Jan 27, 2025
50ed2a0
missing task and clenup
tcapelle Jan 27, 2025
5663157
restore deleted methods
tcapelle Jan 27, 2025
58a1c16
handle cuda not available correctly
tcapelle Jan 27, 2025
f95d9b3
make sync
tcapelle Jan 27, 2025
9ec65b5
invert logic
tcapelle Jan 27, 2025
bc72bfd
make non-fluent default
tcapelle Jan 27, 2025
0058035
add trust scorer
tcapelle Jan 28, 2025
053d6bf
Accept custom artifact path for ContextRelevance
morganmcg1 Feb 5, 2025
f739ba6
Use custom artifact download path for HallucinationScorer
morganmcg1 Feb 5, 2025
197839c
make them run in parallel
tcapelle Feb 5, 2025
6b97d79
rename trust_scorer (missing "r")
tcapelle Feb 5, 2025
bd0adeb
Remove non-HHEM code and flip the threshold logic in HallucinationSco…
morganmcg1 Feb 5, 2025
61d3316
Merge branch 'xtra-scorers' of https://github.com/wandb/weave into xt…
morganmcg1 Feb 5, 2025
f73b7a7
Re-add required load_tokenizer method to HallucinatrionScorer
morganmcg1 Feb 5, 2025
bb824c7
Increase HallucinationScorer recall, threshold from 0.5 to 0.35
morganmcg1 Feb 5, 2025
e7d705d
Add download from custom artifact path to BiasScorer
morganmcg1 Feb 5, 2025
cfe2a0e
Update BiasScorer threshold from 0.5 to 0.65
morganmcg1 Feb 5, 2025
dd13e86
Add custom artifact download ability to ToxicityScorer
morganmcg1 Feb 5, 2025
c97963d
Update HallucinationScorer to accept list of contexts
morganmcg1 Feb 5, 2025
a2a8cf4
Add custom artifact path to CoherenceScorer download
morganmcg1 Feb 5, 2025
c6e981b
Upload embedding model weights for RobustnessScorer
morganmcg1 Feb 5, 2025
9cd5680
Add required load_tokenizer method to RobustnessScorer
morganmcg1 Feb 5, 2025
537f0a8
Modify RobustnessScorer parameters
morganmcg1 Feb 5, 2025
8cc75e9
Update score parameter naming in RobustnessScorer
morganmcg1 Feb 5, 2025
dfbc3b1
fix RobustnessScorer params
morganmcg1 Feb 5, 2025
038ae8e
fix RobustnessScorer
morganmcg1 Feb 5, 2025
ddbcfa6
Add FluencyScorer artifact
morganmcg1 Feb 5, 2025
e343b27
Add custom artifact path option for TrustScorer
morganmcg1 Feb 5, 2025
e5bb561
TrustworthinessScorer args fix
morganmcg1 Feb 5, 2025
a8cbcb8
Update threshold arg for TrustScorer
morganmcg1 Feb 5, 2025
644d312
Output raw scores for Fluency and Toxicity in TrustScore
morganmcg1 Feb 5, 2025
fbbaec7
TrustScorer formating and raise exception on failed scorer
morganmcg1 Feb 5, 2025
915b008
add debugging prints to TrustScorer
morganmcg1 Feb 5, 2025
6ca080c
Update CoherenceScorer to use query not input
morganmcg1 Feb 5, 2025
9c193d7
add TrustScore debug
morganmcg1 Feb 5, 2025
02e09bc
add logging of Fluency score in TrustScorer
morganmcg1 Feb 6, 2025
d91221c
Modify FluencyScorer and CoherenceScorer so that high score is high t…
morganmcg1 Feb 6, 2025
e18e2f0
Use scorer threshold contstants in TrustScore instead of duplicate th…
morganmcg1 Feb 6, 2025
2a3edaa
Clean up TrustScore print
morganmcg1 Feb 6, 2025
1958abc
TrustScore uses direct thresholds, rename robustness weights to embed…
morganmcg1 Feb 6, 2025
c6664a6
add score to FluencyScorer output
morganmcg1 Feb 6, 2025
0d61df3
simlify FluencyScorer score output
morganmcg1 Feb 6, 2025
b549b43
Add critical and adbivsory issues to the TrustScorer output
morganmcg1 Feb 6, 2025
4af281a
In local Scorers: flip 'flagged' to 'pass', clean up base_url
morganmcg1 Feb 6, 2025
6aa8277
modify BiasScorer output
morganmcg1 Feb 6, 2025
eddfbb8
Fix CoherenceScorer score
morganmcg1 Feb 6, 2025
e77c9cf
Rename scorers: local models use Weave*, LLM powered ones LLM*. Delet…
morganmcg1 Feb 6, 2025
d6f6002
Add Weave* to scorer class names, Part 2
morganmcg1 Feb 6, 2025
d27c529
Make naming consistent
tcapelle Feb 6, 2025
16e2701
Merge branch 'master' into xtra-scorers
tcapelle Feb 6, 2025
ebfd32e
fix imports and merge
tcapelle Feb 6, 2025
8cf05ba
remove base_scorer stuff
tcapelle Feb 6, 2025
a7c7bbb
llm_utils -> utils
tcapelle Feb 6, 2025
eaeef28
fix some last imports
tcapelle Feb 6, 2025
54aea81
ignore for the meantime
tcapelle Feb 6, 2025
c44ec0c
lint
tcapelle Feb 6, 2025
3f112a6
delete guardrails
tcapelle Feb 6, 2025
fa6bea0
Update scorers deps, transformers and torch
morganmcg1 Feb 6, 2025
9e0e322
Merge branch 'xtra-scorers' of https://github.com/wandb/weave into xt…
morganmcg1 Feb 6, 2025
342b5af
Try change BiasScorer .score arg to text
morganmcg1 Feb 6, 2025
aa55492
add noop 'output' arg to biasscorer for evals
morganmcg1 Feb 6, 2025
4330385
Add type checking for `score` params to weave models, update docstrings
morganmcg1 Feb 7, 2025
b9a01aa
refactor model download and hf checks
tcapelle Feb 7, 2025
58d450b
raise custom exception
tcapelle Feb 7, 2025
5bae0d8
use find_spec again
tcapelle Feb 7, 2025
358dd5b
Merge remote-tracking branch 'origin/xtra-scorers' into xtra-scorers
tcapelle Feb 7, 2025
d025a19
Re-add misspelled tokenzier
morganmcg1 Feb 7, 2025
d326134
Add misspelled tokenzier comment
morganmcg1 Feb 7, 2025
ddd2aee
Refactor scorer imports and method implementations
morganmcg1 Feb 7, 2025
6368236
remove _validate_input from TrustScorer
morganmcg1 Feb 7, 2025
102c2be
Restore input filtering method in WeaveTrustScorer
morganmcg1 Feb 7, 2025
a6d2bd6
fix Toxicity
tcapelle Feb 7, 2025
266c010
Add ScorerResult type for consistent scorer return values
tcapelle Feb 7, 2025
5e976f3
ruff
tcapelle Feb 7, 2025
24917b1
mypy happy
tcapelle Feb 7, 2025
2b94965
Rename ScorerResult to WeaveScorerResult
tcapelle Feb 7, 2025
68a0bd4
Update scorers to use WeaveScorerResult consistently
morganmcg1 Feb 7, 2025
d94e86f
update loading
tcapelle Feb 7, 2025
fe64df8
mypy happy
tcapelle Feb 7, 2025
bcf334d
Add to_dict to WeaveScorerResult
morganmcg1 Feb 7, 2025
3f77c21
TrustScorer fix
morganmcg1 Feb 7, 2025
4ab47fb
fix PerplexityScorer return type
morganmcg1 Feb 7, 2025
e9caa22
use new hemm model
tcapelle Feb 11, 2025
0d891bb
make loading methods public
tcapelle Feb 11, 2025
2cbc51f
doesn't support description
tcapelle Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,7 @@ jobs:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
HF_API_TOKEN: ${{ secrets.HF_API_TOKEN }}
run: |
nox -e "tests-${{ matrix.python-version-major }}.${{ matrix.python-version-minor }}(shard='${{ matrix.nox-shard }}')"
trace-tests-matrix-check: # This job does nothing and is only used for the branch protection
Expand Down
1 change: 1 addition & 0 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def tests(session, shard):
env["ANTHROPIC_API_KEY"] = session.env.get("ANTHROPIC_API_KEY")
env["MISTRAL_API_KEY"] = session.env.get("MISTRAL_API_KEY")
env["OPENAI_API_KEY"] = session.env.get("OPENAI_API_KEY")
env["HF_API_TOKEN"] = session.env.get("HF_API_TOKEN")

default_test_dirs = [f"integrations/{shard}/"]
test_dirs_dict = {
Expand Down
7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,13 @@ scorers_tests = [
"google-generativeai>=0.8.0",
"mistralai>=1.0.3",
"anthropic>=0.30.0",
"sentence-transformers>=3.3.1",
"scikit-learn>=1.5.2",
"transformers>=4.35.0",
"accelerate>=1.0.0",
"torch>=2.2.0",
"sacrebleu>=2.4.2",
"rouge>=1.0.1",
]
notdiamond = ["notdiamond>=0.3.21", "litellm<=1.49.1"]
openai = ["openai>=1.0.0"]
Expand Down
184 changes: 184 additions & 0 deletions tests/scorers/test_bleu_scorer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
import math

import pytest # type: ignore

from weave.scorers import BLEUScorer


def truncate(number, decimals=0):
"""Truncates a number to the specified number of decimal places without rounding."""
factor = 10.0**decimals
return math.trunc(number * factor) / factor


def test_bleu_scorer_initialization():
# Test default initialization
scorer = BLEUScorer()
assert scorer.lowercase == False
assert scorer.tokenize is None
assert scorer.smooth_method == "exp"
assert scorer.smooth_value is None
assert scorer.max_ngram_order == 4
assert scorer.effective_order == True
assert scorer.bleu is not None

# Test initialization with custom parameters
scorer = BLEUScorer(
lowercase=True,
tokenize="13a",
smooth_method="add-k",
smooth_value=1.0,
max_ngram_order=2,
effective_order=False,
)
assert scorer.lowercase == True
assert scorer.tokenize == "13a"
assert scorer.smooth_method == "add-k"
assert scorer.smooth_value == 1.0
assert scorer.max_ngram_order == 2
assert scorer.effective_order == False
assert scorer.bleu is not None


def test_bleu_scorer_score_method():
scorer = BLEUScorer()
output = "The cat is on the mat."
ground_truths = ["The cat is on the mat.", "There is a cat on the mat."]

# Test score method with exact match
result = scorer.score(ground_truths=ground_truths, output=output)
assert isinstance(result, dict)
assert truncate(result["sentence_bleu"], 1) == 100.0
assert truncate(result["sentence_bp"], 1) == 1.0
assert result["output_pred"] == output
assert result["output_refs"] == ground_truths

# Test score method with partial match
output = "The cat sat on the mat."
result = scorer.score(ground_truths=ground_truths, output=output)
assert result["sentence_bleu"] < 100.0
assert result["output_pred"] == output

# Test with single reference
output = "The dog is in the house."
ground_truths = "The dog is outside."
result = scorer.score(ground_truths=ground_truths, output=output)
assert isinstance(result["output_refs"], list)
assert result["output_refs"] == [ground_truths]


def test_bleu_scorer_score_method_invalid_input():
scorer = BLEUScorer()
output = "Sample output"

# Test with invalid ground_truths type
with pytest.raises(
AssertionError, match="`ground_truths` must be a list of strings."
):
scorer.score(ground_truths=123, output=output)


def test_bleu_scorer_summarize_method():
scorer = BLEUScorer()
score_rows = [
{
"sentence_bleu": 100.0,
"sentence_bp": 1.0,
"output_pred": "The cat is on the mat.",
"output_refs": ["The cat is on the mat."],
},
{
"sentence_bleu": 50.0,
"sentence_bp": 0.8,
"output_pred": "A dog is in the yard.",
"output_refs": ["The dog is in the yard."],
},
{
"sentence_bleu": 0.0,
"sentence_bp": 0.5,
"output_pred": "Completely different sentence.",
"output_refs": ["No match here."],
},
]

result = scorer.summarize(score_rows)
assert isinstance(result, dict)
assert "corpus_level" in result
assert "sentence_level" in result
assert truncate(result["sentence_level"]["bleu"], 1) == 50.0

# Verify corpus-level BLEU score
corpus_bleu = result["corpus_level"]["bleu"]
assert truncate(corpus_bleu, 1) >= 0.0 and truncate(corpus_bleu, 1) <= 100.0


def test_bleu_scorer_summarize_method_empty_input():
scorer = BLEUScorer()
score_rows = []
result = scorer.summarize(score_rows)
assert result == {}


def test_bleu_scorer_summarize_method_invalid_score_rows():
scorer = BLEUScorer()
score_rows = ["invalid", 123, None]
with pytest.raises(AssertionError):
scorer.summarize(score_rows)


def test_bleu_scorer_corpus_score():
scorer = BLEUScorer()
score_rows = [
{
"sentence_bleu": 100.0,
"sentence_bp": 1.0,
"output_pred": "The cat is on the mat.",
"output_refs": ["The cat is on the mat."],
},
{
"sentence_bleu": 50.0,
"sentence_bp": 0.8,
"output_pred": "A dog is in the yard.",
"output_refs": ["The dog is in the yard.", "A dog is outside."],
},
]

result = scorer.summarize(score_rows)
print(result)
corpus_bleu = result["corpus_level"]["bleu"]
assert truncate(corpus_bleu, 1) == 100.0


def test_bleu_scorer_with_different_tokenizer():
# Test BLEUScorer with a different tokenizer
scorer = BLEUScorer(tokenize="char")
output = "abcd"
ground_truths = ["abcf"]

result = scorer.score(ground_truths=ground_truths, output=output)
assert result["sentence_bleu"] < 100.0


def test_bleu_scorer_effective_order():
# Test BLEUScorer with effective_order set to False
scorer = BLEUScorer(effective_order=False)
output = "The cat"
ground_truths = ["The cat is on the mat."]

result = scorer.score(ground_truths=ground_truths, output=output)
# With effective_order=False, the score might be lower due to missing higher-order n-grams
assert result["sentence_bleu"] < 100.0


def test_bleu_scorer_smooth_method():
# Test BLEUScorer with different smoothing methods
scorer = BLEUScorer(smooth_method="floor", smooth_value=0.1)
output = "The cat sat on the mat."
ground_truths = ["The cat is on the mat."]

result = scorer.score(ground_truths=ground_truths, output=output)
assert result["sentence_bleu"] > 0.0

# Test with invalid smoothing method
with pytest.raises(ValueError):
BLEUScorer(smooth_method="invalid_method")
95 changes: 95 additions & 0 deletions tests/scorers/test_coherence_scorer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import pytest

import weave
from weave.scorers.coherence_scorer import CoherenceScorer


@pytest.fixture
def coherence_scorer(monkeypatch):
scorer = CoherenceScorer(
model_name="wandb/coherence_scorer",
device="cpu",
)

def mock_pipeline(*args, **kwargs):
def inner(inputs):
if "incoherent" in inputs["text_pair"] or "incoherent" in inputs["text"]:
return {
"label": "incoherent",
"score": 0.2,
}
return {
"label": "coherent",
"score": 0.95,
}

return inner

monkeypatch.setattr(scorer, "_classifier", mock_pipeline())
return scorer


def test_score_messages_with_coherent_output(coherence_scorer):
prompt = "This is a test prompt."
output = "This is a coherent response."
result = coherence_scorer.score_messages(prompt, output)
assert result["coherent"]
assert result["coherence"] == "coherent"
assert result["coherence_score"] == pytest.approx(0.95)


def test_score_messages_with_incoherent_output(coherence_scorer):
prompt = "This is a test prompt."
output = "This is an incoherent response."
result = coherence_scorer.score_messages(prompt, output)
assert not result["coherent"]
assert result["coherence"] == "incoherent"
assert result["coherence_score"] == pytest.approx(0.2)


@pytest.mark.asyncio
async def test_score_with_chat_history(coherence_scorer):
prompt = "This is a test prompt."
output = "This is a coherent response."
chat_history = [
{"role": "user", "text": "Hello"},
{"role": "assistant", "text": "Hi"},
]
result = await coherence_scorer.score(prompt, output, chat_history=chat_history)
assert result["coherent"]
assert result["coherence"] == "coherent"
assert result["coherence_score"] == pytest.approx(0.95)


@pytest.mark.asyncio
async def test_score_with_context(coherence_scorer):
prompt = "This is a test prompt."
output = "This is a coherent response."
context = "This is additional context."
result = await coherence_scorer.score(prompt, output, context=context)
assert result["coherent"]
assert result["coherence"] == "coherent"
assert result["coherence_score"] == pytest.approx(0.95)


@pytest.mark.asyncio
async def test_coherence_scorer_evaluation(coherence_scorer):
dataset = [
{"input": "This is a coherent text."},
{"input": "This is an incoherent text."},
]

@weave.op
def model(input: str):
return input

evaluation = weave.Evaluation(
dataset=dataset,
scorers=[coherence_scorer],
)
result = await evaluation.evaluate(model)

assert "CoherenceScorer" in result
assert "coherent" in result["CoherenceScorer"]
assert result["CoherenceScorer"]["coherent"]["true_count"] == 1
assert result["CoherenceScorer"]["coherent"]["true_fraction"] == pytest.approx(0.5)
74 changes: 74 additions & 0 deletions tests/scorers/test_llamaguard_scorer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import pytest
from transformers import AutoTokenizer

import weave
from weave.scorers import LlamaGuard

_TINY_MODEL_NAME = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
_LLAMAGUARD_MODEL_NAME = "meta-llama/Llama-Guard-3-1B"


@pytest.fixture
def llamaguard_scorer(monkeypatch):
scorer = LlamaGuard(
model_name=_TINY_MODEL_NAME,
device="cpu",
)
scorer._tokenizer = AutoTokenizer.from_pretrained(_LLAMAGUARD_MODEL_NAME)

# Mock the _generate method to return predictable outputs with unsafe_score
def mock_generate(*args, **kwargs):
return "unsafe\nS10: Hate<|eot_id|>", 0.85 # Added mock unsafe_score

monkeypatch.setattr(scorer, "_generate", mock_generate)
return scorer


def test_llamaguard_postprocess(llamaguard_scorer):
# Test safe content
safe_output = ("safe", 0.1) # Added mock unsafe_score
result = llamaguard_scorer.postprocess(*safe_output)
assert result["safe"]
assert result["category"] is None
assert result["unsafe_score"] == 0.1 # Test unsafe_score

# Test unsafe content with category
unsafe_output = ("unsafe\nS5<|eot_id|>", 0.9) # Added mock unsafe_score
result = llamaguard_scorer.postprocess(*unsafe_output)
assert not result["safe"]
assert result["category"] == "S5: Defamation"
assert result["unsafe_score"] == 0.9 # Test unsafe_score


@pytest.mark.asyncio
async def test_llamaguard_score(llamaguard_scorer):
output = "Test content for scoring"
result = await llamaguard_scorer.score(output=output)
assert isinstance(result, dict)
assert "safe" in result
assert "category" in result
assert "unsafe_score" in result # Test presence of unsafe_score
assert result["safe"] is False
assert result["category"] == "S10: Hate"
assert result["unsafe_score"] == 0.85 # Test unsafe_score matches mock value


@pytest.mark.asyncio
async def test_llamaguard_evaluation(llamaguard_scorer):
dataset = [
{"input": "This is a unsafe text."},
{"input": "This is also bad text"},
]

@weave.op
def model(input: str):
return input

evaluation = weave.Evaluation(
dataset=dataset,
scorers=[llamaguard_scorer],
)
result = await evaluation.evaluate(model)

assert "LlamaGuard" in result
assert "safe" in result["LlamaGuard"]
Loading
Loading