-
Notifications
You must be signed in to change notification settings - Fork 4
[Improvement] Fuzzy matching #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -316,6 +316,7 @@ | |
# highlight-start | ||
index_filterable=True, | ||
index_searchable=True, | ||
tokenization="word", | ||
# highlight-end | ||
), | ||
Property( | ||
|
@@ -324,6 +325,7 @@ | |
# highlight-start | ||
index_filterable=True, | ||
index_searchable=True, | ||
tokenization="field", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be: from weaviate.classes.config import Tokenization
Property(
# ...
tokenization=Tokenization.FIELD
) |
||
# highlight-end | ||
), | ||
Property( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -351,6 +351,12 @@ The `gse` tokenizer is not loaded by default to save resources. To use it, set t | |
- `"素早い茶色の狐が怠けた犬を飛び越えた"`: `["素早", "素早い", "早い", "茶色", "の", "狐", "が", "怠け", "けた", "犬", "を", "飛び", "飛び越え", "越え", "た", "素早い茶色の狐が怠けた犬を飛び越えた"]` | ||
- `"すばやいちゃいろのきつねがなまけたいぬをとびこえた"`: `["すばや", "すばやい", "やい", "いち", "ちゃ", "ちゃい", "ちゃいろ", "いろ", "のき", "きつ", "きつね", "つね", "ねが", "がな", "なま", "なまけ", "まけ", "けた", "けたい", "たい", "いぬ", "を", "とび", "とびこえ", "こえ", "た", "すばやいちゃいろのきつねがなまけたいぬをとびこえた"]` | ||
|
||
:::note `trigram` for fuzzy matching | ||
|
||
While originally designed for Asian languages, `trigram` tokenization is also highly effective for fuzzy matching and typo tolerance in other languages. | ||
|
||
::: | ||
|
||
</details> | ||
|
||
<details> | ||
|
@@ -405,6 +411,42 @@ You can limit the combined number of `gse` and `Kagome` tokenizers running at th | |
|
||
</details> | ||
|
||
<details> | ||
<summary>Fuzzy matching with `trigram` tokenization</summary> | ||
|
||
The `trigram` tokenization method provides fuzzy matching capabilities by breaking text into overlapping 3-character sequences. This enables BM25 searches to find matches even with spelling errors or variations. | ||
|
||
**Use cases for trigram fuzzy matching:** | ||
|
||
- **Typo tolerance**: Find matches despite spelling errors (e.g., "Reliace" matches "Reliance") | ||
- **Name reconciliation**: Match entity names with variations across datasets | ||
- **Search-as-you-type**: Build autocomplete functionality | ||
- **Partial matching**: Find objects with partial string matches | ||
|
||
**How it works:** | ||
|
||
When text is tokenized with `trigram`, it's broken into all possible 3-character sequences: | ||
|
||
- `"hello"` → `["hel", "ell", "llo"]` | ||
- `"world"` → `["wor", "orl", "rld"]` | ||
|
||
Similar strings share many trigrams, enabling fuzzy matching: | ||
|
||
- `"Morgan Stanley"` and `"Stanley Morgn"` share trigrams like `"sta", "tan", "anl", "nle", "ley"` | ||
|
||
**Performance considerations:** | ||
|
||
- Creates larger inverted indexes due to more tokens | ||
- May impact query performance for large datasets | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words. |
||
:::tip | ||
|
||
Use trigram tokenization selectively on fields where fuzzy matching is preferred. Keep exact-match fields with `word` or `field` tokenization for precision. | ||
|
||
::: | ||
|
||
</details> | ||
|
||
--- | ||
|
||
### Inverted index {#inverted-index} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
--- | ||
title: Keyword search | ||
description: Weaviate BM25 keyword search documentation covering basic queries, search operators, scoring, property targeting, weighting, tokenization, filtering and fuzzy matching. | ||
sidebar_position: 40 | ||
image: og/docs/howto.jpg | ||
# tags: ['how to', 'similarity search'] | ||
|
@@ -8,13 +9,13 @@ image: og/docs/howto.jpg | |
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import FilteredTextBlock from '@site/src/components/Documentation/FilteredTextBlock'; | ||
import PyCode from '!!raw-loader!/_includes/code/howto/search.bm25.py'; | ||
import PyCodeV3 from '!!raw-loader!/_includes/code/howto/search.bm25-v3.py'; | ||
import TSCode from '!!raw-loader!/_includes/code/howto/search.bm25.ts'; | ||
import TSCodeLegacy from '!!raw-loader!/_includes/code/howto/search.bm25-v2.ts'; | ||
import GoCode from '!!raw-loader!/_includes/code/howto/go/docs/mainpkg/search-bm25_test.go'; | ||
import JavaCode from '!!raw-loader!/_includes/code/howto/java/src/test/java/io/weaviate/docs/search/KeywordSearchTest.java'; | ||
import GQLCode from '!!raw-loader!/_includes/code/howto/search.bm25.gql.py'; | ||
import PyCode from '!!raw-loader!/\_includes/code/howto/search.bm25.py'; | ||
import PyCodeV3 from '!!raw-loader!/\_includes/code/howto/search.bm25-v3.py'; | ||
import TSCode from '!!raw-loader!/\_includes/code/howto/search.bm25.ts'; | ||
import TSCodeLegacy from '!!raw-loader!/\_includes/code/howto/search.bm25-v2.ts'; | ||
import GoCode from '!!raw-loader!/\_includes/code/howto/go/docs/mainpkg/search-bm25_test.go'; | ||
import JavaCode from '!!raw-loader!/\_includes/code/howto/java/src/test/java/io/weaviate/docs/search/KeywordSearchTest.java'; | ||
import GQLCode from '!!raw-loader!/\_includes/code/howto/search.bm25.gql.py'; | ||
|
||
`Keyword` search, also called "BM25 (Best match 25)" or "sparse vector" search, returns objects that have the highest BM25F scores. | ||
|
||
|
@@ -239,9 +240,6 @@ The response is like this: | |
|
||
## Search on selected properties only | ||
|
||
:::info Added in `v1.19.0` | ||
::: | ||
|
||
A keyword search can be directed to only search a subset of object properties. In this example, the BM25 search only uses the `question` property to produce the BM25F score. | ||
|
||
<Tabs groupId="languages"> | ||
|
@@ -320,13 +318,11 @@ The response is like this: | |
endMarker="# END Expected BM25WithProperties results" | ||
language="json" | ||
/> | ||
|
||
</details> | ||
|
||
## Use weights to boost properties | ||
|
||
:::info Added in `v1.19.0` | ||
::: | ||
|
||
You can weight how much each property affects the overall BM25F score. This example boosts the `question` property by a factor of 2 while the `answer` property remains static. | ||
|
||
<Tabs groupId="languages"> | ||
|
@@ -384,7 +380,6 @@ You can weight how much each property affects the overall BM25F score. This exam | |
/> | ||
</TabItem> | ||
|
||
|
||
<TabItem value="graphql" label="GraphQL"> | ||
<FilteredTextBlock | ||
text={PyCodeV3} | ||
|
@@ -409,18 +404,16 @@ The response is like this: | |
|
||
</details> | ||
|
||
|
||
## Set tokenization | ||
|
||
The BM25 query string is [tokenized](../config-refs/collections.mdx#tokenization) before it is used to search for objects using the inverted index. | ||
|
||
You must specify the tokenization method in the collection definition for [each property](../manage-collections/vector-config.mdx#property-level-settings). | ||
|
||
import TknPyCode from '!!raw-loader!/_includes/code/howto/manage-data.collections.py'; | ||
import TknPyCodeV3 from '!!raw-loader!/_includes/code/howto/manage-data.collections-v3.py'; | ||
import TknTsCode from '!!raw-loader!/_includes/code/howto/manage-data.collections.ts'; | ||
import TknTsCodeLegacy from '!!raw-loader!/_includes/code/howto/manage-data.collections-v2.ts'; | ||
|
||
import TknPyCode from '!!raw-loader!/\_includes/code/howto/manage-data.collections.py'; | ||
import TknPyCodeV3 from '!!raw-loader!/\_includes/code/howto/manage-data.collections-v3.py'; | ||
import TknTsCode from '!!raw-loader!/\_includes/code/howto/manage-data.collections.ts'; | ||
import TknTsCodeLegacy from '!!raw-loader!/\_includes/code/howto/manage-data.collections-v2.ts'; | ||
|
||
<Tabs groupId="languages"> | ||
<TabItem value="py" label="Python Client v4"> | ||
|
@@ -469,6 +462,12 @@ import TknTsCodeLegacy from '!!raw-loader!/_includes/code/howto/manage-data.coll | |
</TabItem> | ||
</Tabs> | ||
|
||
:::tip Tokenization and fuzzy matching | ||
|
||
For fuzzy matching and typo tolerance, use `trigram` tokenization. See the [fuzzy matching section](#fuzzy-matching) above for details. | ||
|
||
::: | ||
|
||
## `limit` & `offset` | ||
|
||
Use `limit` to set a fixed maximum number of objects to return. | ||
|
@@ -747,18 +746,56 @@ The response is like this: | |
|
||
### Tokenization | ||
|
||
import TokenizationNote from '/_includes/tokenization.mdx' | ||
import TokenizationNote from '/\_includes/tokenization.mdx' | ||
|
||
<TokenizationNote /> | ||
|
||
## Related pages | ||
## Fuzzy matching | ||
|
||
You can enable fuzzy matching and typo tolerance in BM25 searches by using [`trigram` tokenization](../config-refs/collections.mdx#tokenization). This technique breaks text into overlapping 3-character sequences, allowing BM25 to find matches even when there are spelling errors or variations. | ||
|
||
This enables matching between similar but not identical strings because they share many trigrams: | ||
|
||
- `"Morgn"` and `"Morgan"` share trigrams like `"org", "rga", "gan"` | ||
|
||
Set the tokenization method to `trigram` at the property level when creating your collection: | ||
|
||
<Tabs groupId="languages"> | ||
<TabItem value="py" label="Python Client v4"> | ||
<FilteredTextBlock | ||
text={TknPyCode} | ||
startMarker="# START TrigramTokenization" | ||
endMarker="# END TrigramTokenization" | ||
language="py" | ||
/> | ||
</TabItem> | ||
<TabItem value="js" label="JS/TS Client v3"> | ||
<FilteredTextBlock | ||
text={TknTsCode} | ||
startMarker="// START TrigramTokenization" | ||
endMarker="// END TrigramTokenization" | ||
language="ts" | ||
/> | ||
</TabItem> | ||
</Tabs> | ||
|
||
:::tip Best practices | ||
|
||
- Use trigram tokenization selectively on fields that need fuzzy matching. | ||
- Keep exact-match fields with `word` or `field` tokenization for precision. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note trigram tokenization will impact filtering behavior, as token comparisons will be based on trigrams, rather than words. |
||
::: | ||
|
||
|
||
## Further resources | ||
|
||
- [Connect to Weaviate](/weaviate/connections/index.mdx) | ||
- [Connect to Weaviate](../connections/index.mdx) | ||
- [API References: Search operators # BM25](../api/graphql/search-operators.md#bm25) | ||
- [Reference: Tokenization options](../config-refs/collections.mdx#tokenization) | ||
- [Weaviate Academy: Tokenization](../../academy/py/tokenization/index.md) | ||
|
||
## Questions and feedback | ||
|
||
import DocsFeedback from '/_includes/docs-feedback.mdx'; | ||
import DocsFeedback from '/\_includes/docs-feedback.mdx'; | ||
|
||
<DocsFeedback/> |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
(unless it accepts raw strings also? - I guess maybe it does since it's just an enum)
The function signature is this:
tokenization: Optional[Tokenization] = Field(default=None)
- so even if the string works I wonder if the enum is preferable.