Skip to content

[Improvement] Fuzzy matching #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

[Improvement] Fuzzy matching #137

wants to merge 4 commits into from

Conversation

g-despot
Copy link
Contributor

What's being changed:

Add docs about using trigram tokenization for fuzzy string searches.

Type of change:

  • Documentation content updates (non-breaking change to fix/update documentation )

How has this been tested?

  • Local build - the site works as expected when running yarn start

Copy link

@orca-security-eu orca-security-eu bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

@weaviate-git-bot
Copy link

Great to see you again! Thanks for the contribution.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

@g-despot g-despot requested a review from databyjp August 18, 2025 07:43
@@ -316,6 +316,7 @@
# highlight-start
index_filterable=True,
index_searchable=True,
tokenization="word",
Copy link
Contributor

@databyjp databyjp Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

from weaviate.classes.config import Tokenization

Property(
    # ...
    tokenization=Tokenization.WORD
)

(unless it accepts raw strings also? - I guess maybe it does since it's just an enum)

The function signature is this: tokenization: Optional[Tokenization] = Field(default=None) - so even if the string works I wonder if the enum is preferable.

@@ -324,6 +325,7 @@
# highlight-start
index_filterable=True,
index_searchable=True,
tokenization="field",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

from weaviate.classes.config import Tokenization

Property(
    # ...
    tokenization=Tokenization.FIELD
)


- Creates larger inverted indexes due to more tokens
- May impact query performance for large datasets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words.


- Use trigram tokenization selectively on fields that need fuzzy matching.
- Keep exact-match fields with `word` or `field` tokenization for precision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note trigram tokenization will impact filtering behavior, as token comparisons will be based on trigrams, rather than words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants