diff --git a/docs/feature/search/index.md b/docs/feature/search/index.md index 880df94a..518aae9b 100644 --- a/docs/feature/search/index.md +++ b/docs/feature/search/index.md @@ -4,11 +4,212 @@ # Full-Text Search +:::{include} /_include/links.md +::: +:::{include} /_include/styles.html +::: + +**BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.** + +:::::{grid} +:padding: 0 + +::::{grid-item} +:class: rubric-slim +:columns: auto 9 9 9 + + +:::{rubric} Overview +::: +CrateDB can be used as a database to conduct full-text search operations +building upon the power of Apache Lucene. + +:::{rubric} About +::: +[Full-text search] leverages the [BM25] search ranking algorithm, effectively +implementing the storage and retrieval parts of a [search engine]. + +In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) +is a ranking function used by search engines to estimate the relevance of +documents to a given search query. +:::: + + +::::{grid-item} +:class: rubric-slim +:columns: auto 3 3 3 + +```{rubric} Reference Manual +``` +- [](inv:crate-reference#sql_dql_fulltext_search) +- [](inv:crate-reference#fulltext-indices) +- [](inv:crate-reference#ref-create-analyzer) + +```{rubric} Related +``` +- {ref}`sql` +- {ref}`vector` +- {ref}`machine-learning` +- {ref}`query` + +{tags-primary}`SQL` +{tags-primary}`Full-Text Search` +{tags-primary}`Okapi BM25` +:::: + +::::: + + +:::{rubric} Details +::: +CrateDB uses Lucene as a storage layer, so it inherits the implementation +and concepts of Lucene, in the same spirit as Elasticsearch. +The now popular BM25 method has become the default scoring formula in Lucene +and is the scoring formula used by CrateDB. + +BM25 stands for "Best Match 25", the 25th iteration of this scoring algorithm. +The excellent article [BM25: The Next Generation of Lucene Relevance] compares +classic TF/IDF to [Okapi BM25], including illustrative graphs. +To learn more details about what's inside, please also refer to [Similarity in +Elasticsearch] and [BM25 vs. Lucene Default Similarity]. + +:::{div} +While Elasticsearch uses a [query DSL based on JSON], in CrateDB, you can work +with text search using SQL. +::: + + +## Synopsis + +Store and query word embeddings using similarity search based on Cosine +distance. + +::::{grid} +:padding: 0 +:class-row: title-slim + +:::{grid-item} **DDL** +:columns: auto 6 6 6 + +```sql +CREATE TABLE documents ( + name STRING PRIMARY KEY, + description TEXT, + INDEX ft_english + USING FULLTEXT(description) WITH ( + analyzer = 'english' + ), + INDEX ft_german + USING FULLTEXT(description) WITH ( + analyzer = 'german' + ) +); +``` +::: + +:::{grid-item} **DQL** +:columns: auto 6 6 6 + +```sql +SELECT name +FROM documents +WHERE MATCH(ft_english, 'jump'); + +SELECT name +FROM documents +WHERE MATCH(ft_german, 'verwahrlost'); +``` +::: + +:::: + + +::::{grid} +:padding: 0 +:class-row: title-slim + +:::{grid-item} **DML** +:columns: auto 6 6 6 + +```sql +INSERT INTO documents (name, description) +VALUES + ('Quick fox', 'The quick brown fox jumps over the lazy dog.'), + ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.') +; +``` +::: + +:::{grid-item} **Result** +:columns: auto 6 6 6 + +```text ++-----------+ +| name | ++-----------+ +| Quick fox | ++-----------+ +SELECT 1 row in set (0.004 sec) + ++------------+ +| name | ++------------+ +| Franz jagt | ++------------+ +SELECT 1 row in set (0.003 sec) +``` +::: + +:::: + + +## Usage + +Using full-text search in CrateDB. + +:::{rubric} `MATCH` predicate +::: +CrateDB's [MATCH predicate] performs a fulltext search on one or more indexed +columns or indices and supports different matching techniques. + +In order to use fulltext searches on a column, a [fulltext index with an +analyzer] must be created for this column. + +:::{rubric} Query Language +::: +:::{todo} +Illustrate capabilities of the Lucene query language. +::: + +:::{rubric} Analyzer +::: +Analyzers consist of two parts, filters, and tokenizers. Each analyzer must +contain one tokenizer and only one tokenizer can be used. + +Tokenizers decide how to divide the given text into parts. Filters perform +a series of transformations by passing the given text through a number of +operations. They are divided into token filters and character filters, +discriminating between filters applied before, or after the tokenization +step. + +Popular filters are stopword lists, lowercase transformations, or word +stemmers. +The excellent article [Improve Your Text Search with Lucene Analyzers] +illustrates more details about this topic on behalf of Elasticsearch. + + + +## Learn + Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A must-read for anyone looking to make sense of large volumes of unstructured text data. +:::{rubric} Tutorials +::: + - [](inv:cloud#full-text) +- [Custom analyzer combining multiple token filters] :::{note} @@ -17,3 +218,16 @@ data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets. ::: + + +[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25 +[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ +[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity +[Custom analyzer combining multiple token filters]: https://community.cratedb.com/t/fuzzy-search-synonyms/889 +[full-text search]: https://en.wikipedia.org/wiki/Full_text_search +[Improve Your Text Search with Lucene Analyzers]: https://medium.com/@dagliberkay/elastic-text-search-6b778de9b753 +[MATCH predicate]: inv:crate-reference#predicates_match +[Okapi BM25]: https://trec.nist.gov/pubs/trec3/papers/city.ps.gz +[search engine]: https://en.wikipedia.org/wiki/Search_engine +[Similarity in Elasticsearch]: https://www.elastic.co/blog/found-similarity-in-elasticsearch +[TREC-3 proceedings]: https://trec.nist.gov/pubs/trec3/t3_proceedings.html