Skip to content

Commit

Permalink
Updated full-text search scoring and syntax (#2343)
Browse files Browse the repository at this point in the history
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
  • Loading branch information
writinwaters authored Dec 9, 2024
1 parent 3d7a06e commit 52f38ca
Show file tree
Hide file tree
Showing 10 changed files with 81 additions and 24 deletions.
2 changes: 2 additions & 0 deletions docs/getstarted/build_from_source.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ import TabItem from '@theme/TabItem';

Build Infinity from source, build and run unit/functional tests.

---

This document provides instructions for building Infinity from source, as well as building and running unit and functional tests.

:::tip NOTE
Expand Down
2 changes: 2 additions & 0 deletions docs/getstarted/deploy_infinity_server.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ import TabItem from '@theme/TabItem';

Three ways to deploy Infinity.

---

This document provides guidance on deploying the Infinity database. In general, you can deploy Infinity in the following three ways:

- [Import Infinity as a Python module](#import-infinity-as-a-python-module): To run Infinity locally as a Python module.
Expand Down
82 changes: 65 additions & 17 deletions docs/guides/search_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ slug: /search_guide

Full-text, vector, sparse vector, tensor, hybrid search.

---

## Overview

This document offers guidance on conducting a search within Infinity.
Expand Down Expand Up @@ -71,32 +73,78 @@ Both RAG tokenization and fine-grained RAG tokenization are used in RAGFlow to e

#### IK analyzer

The IK analyzer is a bilingual tokenizer that supports Chinese (simplified and traditional) and English. It is a C++ adaptation of the [IK Analyzer](https://github.com/infinilabs/analysis-ik), widely used as a tokenizer by Chinese Elasticsearch users.

Use `"ik"` to select this analyzer, which works the same as the `ik_smart` argument in the [IK Analyzer](https://github.com/infinilabs/analysis-ik), or `"ik-fine"` for fine-grained mode, which works the same as the `ik_max_word` argument in the [IK Analyzer](https://github.com/infinilabs/analysis-ik).
The IK analyzer is a bilingual tokenizer that supports Chinese (simplified and traditional) and English. It is a C++ adaptation of the [IK Analyzer](https://github.com/infinilabs/analysis-ik), which is widely used as a tokenizer by Chinese Elasticsearch users.

Use `"ik"` to select this analyzer, which is equivalent to the `ik_smart` option in the [IK Analyzer](https://github.com/infinilabs/analysis-ik), or `"ik-fine"` for fine-grained mode, which is equivalent to the `ik_max_word` option in the [IK Analyzer](https://github.com/infinilabs/analysis-ik).

#### Keyword analyzer

The keyword analyzer is a "noop" analyzer used for columns containing keywords only, where traditional scoring methods like `BM25` do not apply. It scores `0` or `1`, depending on whether any keywords are matched.

### Search and ranking
Use `"keyword"` to select this analyzer.

### Search and ranking syntax

Infinity supports the following syntax or full-text search expressions:

- Single term
- AND multiple terms
- OR multiple terms
- Phrase search
- CARAT opertor
- Sloppy phrase search
- Field-specific search
- Escape character

#### Single term

Example: `"blooms"`

#### AND multiple terms

- `"space AND efficient"`
- `"space && efficient"`
- `"space + efficient"`

#### OR multiple terms

- `"Bloom OR filter"`
- `"Bloom || filter"`
- `"Bloom filter"`

:::tip NOTE
`OR` is the default semantic in a multi-term full-text search unless explicitly specified otherwise.
:::

#### Phrase search

- `"Bloom filter"`
- `'Bloom filter'`

#### CARAT operator

Use `^` to boost the importance of a specific term. For example: `quick^2 brown` boosts the importance of `quick` by a factor of 2, making it twice as important as `brown`.

#### Sloppy phrase search

Example: `'"harmful chemical"~10'`

#### Field-specific search

Example: `"title:(quick OR brown) AND body:foobar"`

#### Escape character

Infinity offers following syntax for full-text search:
Use `\` to escape reserved characters like `:` `~` `(` `)` `"` `+` `-` `=` `&` `|` `[` `]` `{` `}` `*` `?` `\` `/`. For example: `"space\-efficient"`.

- Single term: `"blooms"`
- AND multiple terms: `"space AND efficient"`, `"space && efficient"` or `"space + efficient"`
- OR multiple terms: `"Bloom OR filter"`, `"Bloom || filter"` or just `"Bloom filter"` .
- Phrase search: `"Bloom filter" or 'Bloom filter'`
- CARAT operator: `^`: Used to boost the importance of a term, e.g., `quick^2 brown` boosts the importance of `quick` by a factor of 2, making it twice as important as `brown`.
- Sloppy phrase search: `'"harmful chemical"~10'`
- Field-specific search: `"title:(quick OR brown) AND body:foobar"`
- Escaping reserved characters: `"space\-efficient"` . `:` `~` `()` `""` `+` `-` `=` `&` `|` `[]` `{}` `*` `?` `\` `/` are reserved characters for search syntax.
### Scoring

`OR` is the default semantic among multiple terms if user does not specify in search syntax. Infinity offers `BM25` scoring and block-max `WAND` for dynamic pruning to accelerate the multiple terms search processing. There are two approaches to bypass `BM25` scoring:
Infinity offers `BM25` scoring and block-max `WAND` for dynamic pruning to accelerate multi-term searches. To *not* use `BM25` scoring, do either of the following:

- Using `keyword` analyzer when creating index, then `BM25` will not be used and it will return the score based on whether keywords are hit.
- Specifying `similarity=boolean` during searching. Then the scoring is decided by the number of keywords hits.
- Set `"analyzer"` to `"keyword"` when creating index (to select the keyword analyzer).
*The returned score will then be based on whether keywords are matched.*
- Add `{"similarity": "boolean"}` as a search option.
*The scoring will then depend on the number of matched keywords.*

## Dense vector search

Expand Down Expand Up @@ -149,7 +197,7 @@ Infinity offers three types of rerankers for fusion:

## Conditional filters

Conditional filters in Infinity must work through an index to facilitate search. There are two types of indexes in Infinity that support conditional filters:
Conditional filters in Infinity must work through an index to facilitate search. The following two types of indexes in Infinity support conditional filters:

- **Secondary index**: Built on numeric or string columns. This index does not apply any tokenization to a string column when using conditional filters.
- **Full-text index**: Built on full-text columns. This index applies tokenization to the full-text column but does not trigger any relevance scoring procedure.
Expand Down
2 changes: 2 additions & 0 deletions docs/guides/set_up_cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ slug: /set_up_cluster

Architecture overview and user guide for Infinity cluster.

---

## Overview

An Infinity cluster consists of one leader node, up to four follower nodes, and several learner nodes:
Expand Down
3 changes: 2 additions & 1 deletion docs/references/benchmark.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
---
sidebar_position: 1
sidebar_position: 3
slug: /benchmark
---
# Benchmark

This document compares the following key specifications of Elasticsearch, Qdrant, Quickwit and Infinity:

- Time to insert & build index
Expand Down
4 changes: 3 additions & 1 deletion docs/references/configurations.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 5
sidebar_position: 0
slug: /configurations
---

Expand All @@ -9,6 +9,8 @@ import TabItem from '@theme/TabItem';

How to set and load configuration file when starting Infinity.

---

This document provides instructions for loading configuration file for Infinity and descriptions of each configuration entry.


Expand Down
2 changes: 1 addition & 1 deletion docs/references/faq.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 4
slug: /FAQ
---

Expand Down
2 changes: 1 addition & 1 deletion docs/references/http_api_reference.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 1
slug: /http_api_reference
---

Expand Down
4 changes: 2 additions & 2 deletions docs/references/pysdk_api_reference.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 2
slug: /pysdk_api_reference
---
# Python API Reference
Expand Down Expand Up @@ -1571,7 +1571,7 @@ table_object.delete("c1 >= 70 and c1 <= 90")

---

### update data
### update

```python
table_object.update(cond, data)
Expand Down
2 changes: 1 addition & 1 deletion src/storage/invertedindex/search/phrase_doc_iterator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ bool PhraseDocIterator::GetSloppyPhraseMatchData() {
term_pos_i : term i's current position in document
phrase_pos_i: term_pos_i - pos_i
For a solution (term_pos_0, term_pos_1, ..., term_pos_n), it's acceptable iff:
For a solution (term_pos_0, term_pos_1, ..., term_pos_n), it's acceptable if:
for any i, j (0<=i<=n, 0<=j<=n), |phrase_pos_i - phrase_pos_j| <= slop
For an acceptable solution, its matchLength is:
Expand Down

0 comments on commit 52f38ca

Please sign in to comment.