[Improvement] Fuzzy matching #137

g-despot · 2025-08-14T10:41:43Z

What's being changed:

Add docs about using trigram tokenization for fuzzy string searches.

Type of change:

Documentation content updates (non-breaking change to fix/update documentation )

How has this been tested?

Local build - the site works as expected when running yarn start

orca-security-eu

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Passed	SAST	0 0 0 0	View in Orca
Passed	Secrets	0 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

weaviate-git-bot · 2025-08-15T10:04:55Z

Great to see you again! Thanks for the contribution.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

databyjp · 2025-08-18T19:27:41Z

_includes/code/config-refs/reference.collections.py

@@ -316,6 +316,7 @@
            # highlight-start
            index_filterable=True,
            index_searchable=True,
+            tokenization="word",


I think this should be:

from weaviate.classes.config import Tokenization Property( # ... tokenization=Tokenization.WORD )

(unless it accepts raw strings also? - I guess maybe it does since it's just an enum)

The function signature is this: tokenization: Optional[Tokenization] = Field(default=None) - so even if the string works I wonder if the enum is preferable.

databyjp · 2025-08-18T19:27:53Z

_includes/code/config-refs/reference.collections.py

@@ -324,6 +325,7 @@
            # highlight-start
            index_filterable=True,
            index_searchable=True,
+            tokenization="field",


I think this should be:

from weaviate.classes.config import Tokenization Property( # ... tokenization=Tokenization.FIELD )

databyjp · 2025-08-18T19:33:10Z

docs/weaviate/config-refs/collections.mdx

+
+- Creates larger inverted indexes due to more tokens
+- May impact query performance for large datasets
+


Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words.

databyjp · 2025-08-18T19:40:00Z

docs/weaviate/search/bm25.md

+
+- Use trigram tokenization selectively on fields that need fuzzy matching.
+- Keep exact-match fields with `word` or `field` tokenization for precision.
+


Note trigram tokenization will impact filtering behavior, as token comparisons will be based on trigrams, rather than words.

g-despot added 2 commits August 14, 2025 09:57

Update docs

1030d4e

Add fuzzy mathing docs

399e099

orca-security-eu bot reviewed Aug 14, 2025

View reviewed changes

Update docs

2e585d2

Update docs

f575f9e

g-despot requested a review from databyjp August 18, 2025 07:43

databyjp reviewed Aug 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Improvement] Fuzzy matching #137

[Improvement] Fuzzy matching #137

Uh oh!

g-despot commented Aug 14, 2025

Uh oh!

orca-security-eu bot left a comment •

edited

Loading

Uh oh!

weaviate-git-bot commented Aug 15, 2025

Uh oh!

databyjp Aug 18, 2025 •

edited

Loading

Uh oh!

databyjp Aug 18, 2025

Uh oh!

databyjp Aug 18, 2025

Uh oh!

databyjp Aug 18, 2025

Uh oh!

Uh oh!


		- Creates larger inverted indexes due to more tokens
		- May impact query performance for large datasets


		- Use trigram tokenization selectively on fields that need fuzzy matching.
		- Keep exact-match fields with `word` or `field` tokenization for precision.

[Improvement] Fuzzy matching #137

Are you sure you want to change the base?

[Improvement] Fuzzy matching #137

Uh oh!

Conversation

g-despot commented Aug 14, 2025

What's being changed:

Type of change:

How has this been tested?

Uh oh!

orca-security-eu bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

weaviate-git-bot commented Aug 15, 2025

Uh oh!

databyjp Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

databyjp Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

databyjp Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

databyjp Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orca-security-eu bot left a comment •

edited

Loading

databyjp Aug 18, 2025 •

edited

Loading