Skip to content

Commit

Permalink
Feature / Search: Implement page
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Mar 14, 2024
1 parent 6afd38d commit 43b0245
Showing 1 changed file with 214 additions and 0 deletions.
214 changes: 214 additions & 0 deletions docs/feature/search/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,212 @@

# Full-Text Search

:::{include} /_include/links.md
:::
:::{include} /_include/styles.html
:::

**BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.**

:::::{grid}
:padding: 0

::::{grid-item}
:class: rubric-slim
:columns: auto 9 9 9


:::{rubric} Overview
:::
CrateDB can be used as a database to conduct full-text search operations
building upon the power of Apache Lucene.

:::{rubric} About
:::
[Full-text search] leverages the [BM25] search ranking algorithm, effectively
implementing the storage and retrieval parts of a [search engine].

In information retrieval, Okapi BM25 (BM is an abbreviation of best matching)
is a ranking function used by search engines to estimate the relevance of
documents to a given search query.
::::


::::{grid-item}
:class: rubric-slim
:columns: auto 3 3 3

```{rubric} Reference Manual
```
- [](inv:crate-reference#sql_dql_fulltext_search)
- [](inv:crate-reference#fulltext-indices)
- [](inv:crate-reference#ref-create-analyzer)

```{rubric} Related
```
- {ref}`sql`
- {ref}`vector`
- {ref}`machine-learning`
- {ref}`query`

{tags-primary}`SQL`
{tags-primary}`Full-Text Search`
{tags-primary}`Okapi BM25`
::::

:::::


:::{rubric} Details
:::
CrateDB uses Lucene as a storage layer, so it inherits the implementation
and concepts of Lucene, in the same spirit as Elasticsearch.
The now popular BM25 method has become the default scoring formula in Lucene
and is the scoring formula used by CrateDB.

BM25 stands for "Best Match 25", the 25th iteration of this scoring algorithm.
The excellent article [BM25: The Next Generation of Lucene Relevance] compares
classic TF/IDF to [Okapi BM25], including illustrative graphs.
To learn more details about what's inside, please also refer to [Similarity in
Elasticsearch] and [BM25 vs. Lucene Default Similarity].

:::{div}
While Elasticsearch uses a [query DSL based on JSON], in CrateDB, you can work
with text search using SQL.
:::


## Synopsis

Store and query word embeddings using similarity search based on Cosine
distance.

::::{grid}
:padding: 0
:class-row: title-slim

:::{grid-item} **DDL**
:columns: auto 6 6 6

```sql
CREATE TABLE documents (
name STRING PRIMARY KEY,
description TEXT,
INDEX ft_english
USING FULLTEXT(description) WITH (
analyzer = 'english'
),
INDEX ft_german
USING FULLTEXT(description) WITH (
analyzer = 'german'
)
);
```
:::

:::{grid-item} **DQL**
:columns: auto 6 6 6

```sql
SELECT name
FROM documents
WHERE MATCH(ft_english, 'jump');

SELECT name
FROM documents
WHERE MATCH(ft_german, 'verwahrlost');
```
:::

::::


::::{grid}
:padding: 0
:class-row: title-slim

:::{grid-item} **DML**
:columns: auto 6 6 6

```sql
INSERT INTO documents (name, description)
VALUES
('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
;
```
:::

:::{grid-item} **Result**
:columns: auto 6 6 6

```text
+-----------+
| name |
+-----------+
| Quick fox |
+-----------+
SELECT 1 row in set (0.004 sec)
+------------+
| name |
+------------+
| Franz jagt |
+------------+
SELECT 1 row in set (0.003 sec)
```
:::

::::


## Usage

Using full-text search in CrateDB.

:::{rubric} `MATCH` predicate
:::
CrateDB's [MATCH predicate] performs a fulltext search on one or more indexed
columns or indices and supports different matching techniques.

In order to use fulltext searches on a column, a [fulltext index with an
analyzer] must be created for this column.

:::{rubric} Query Language
:::
:::{todo}
Illustrate capabilities of the Lucene query language.
:::

:::{rubric} Analyzer
:::
Analyzers consist of two parts, filters, and tokenizers. Each analyzer must
contain one tokenizer and only one tokenizer can be used.

Tokenizers decide how to divide the given text into parts. Filters perform
a series of transformations by passing the given text through a number of
operations. They are divided into token filters and character filters,
discriminating between filters applied before, or after the tokenization
step.

Popular filters are stopword lists, lowercase transformations, or word
stemmers.
The excellent article [Improve Your Text Search with Lucene Analyzers]
illustrates more details about this topic on behalf of Elasticsearch.



## Learn

Learn how to set up your database for full-text search, how to create the
relevant indices, and how to query your text data efficiently. A must-read
for anyone looking to make sense of large volumes of unstructured text data.

:::{rubric} Tutorials
:::

- [](inv:cloud#full-text)
- [Custom analyzer combining multiple token filters]


:::{note}
Expand All @@ -17,3 +218,16 @@ data sets. One of its standout features are its full-text search capabilities,
built on top of the powerful Lucene library. This makes it a great fit for
organizing, searching, and analyzing extensive datasets.
:::


[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25
[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
[Custom analyzer combining multiple token filters]: https://community.cratedb.com/t/fuzzy-search-synonyms/889
[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
[Improve Your Text Search with Lucene Analyzers]: https://medium.com/@dagliberkay/elastic-text-search-6b778de9b753
[MATCH predicate]: inv:crate-reference#predicates_match
[Okapi BM25]: https://trec.nist.gov/pubs/trec3/papers/city.ps.gz
[search engine]: https://en.wikipedia.org/wiki/Search_engine
[Similarity in Elasticsearch]: https://www.elastic.co/blog/found-similarity-in-elasticsearch
[TREC-3 proceedings]: https://trec.nist.gov/pubs/trec3/t3_proceedings.html

0 comments on commit 43b0245

Please sign in to comment.