-
Notifications
You must be signed in to change notification settings - Fork 4
[Improvement] Fuzzy matching #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orca Security Scan Summary
Status | Check | Issues by priority | |
---|---|---|---|
![]() |
Infrastructure as Code | ![]() ![]() ![]() ![]() |
View in Orca |
![]() |
SAST | ![]() ![]() ![]() ![]() |
View in Orca |
![]() |
Secrets | ![]() ![]() ![]() ![]() |
View in Orca |
![]() |
Vulnerabilities | ![]() ![]() ![]() ![]() |
View in Orca |
Great to see you again! Thanks for the contribution. |
@@ -316,6 +316,7 @@ | |||
# highlight-start | |||
index_filterable=True, | |||
index_searchable=True, | |||
tokenization="word", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
from weaviate.classes.config import Tokenization
Property(
# ...
tokenization=Tokenization.WORD
)
(unless it accepts raw strings also? - I guess maybe it does since it's just an enum)
The function signature is this: tokenization: Optional[Tokenization] = Field(default=None)
- so even if the string works I wonder if the enum is preferable.
@@ -324,6 +325,7 @@ | |||
# highlight-start | |||
index_filterable=True, | |||
index_searchable=True, | |||
tokenization="field", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
from weaviate.classes.config import Tokenization
Property(
# ...
tokenization=Tokenization.FIELD
)
|
||
- Creates larger inverted indexes due to more tokens | ||
- May impact query performance for large datasets | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words.
|
||
- Use trigram tokenization selectively on fields that need fuzzy matching. | ||
- Keep exact-match fields with `word` or `field` tokenization for precision. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note trigram tokenization will impact filtering behavior, as token comparisons will be based on trigrams, rather than words.
What's being changed:
Add docs about using trigram tokenization for fuzzy string searches.
Type of change:
How has this been tested?
yarn start