Feature / Search: Implement page

crate · Mar 14, 2024 · 43b0245 · 43b0245
1 parent 6afd38d
commit 43b0245
Showing 1 changed file with 214 additions and 0 deletions.
diff --git a/docs/feature/search/index.md b/docs/feature/search/index.md
@@ -4,11 +4,212 @@
 
 # Full-Text Search
 
+:::{include} /_include/links.md
+:::
+:::{include} /_include/styles.html
+:::
+
+**BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.**
+
+:::::{grid}
+:padding: 0
+
+::::{grid-item}
+:class: rubric-slim
+:columns: auto 9 9 9
+
+
+:::{rubric} Overview
+:::
+CrateDB can be used as a database to conduct full-text search operations
+building upon the power of Apache Lucene.
+
+:::{rubric} About
+:::
+[Full-text search] leverages the [BM25] search ranking algorithm, effectively
+implementing the storage and retrieval parts of a [search engine].
+
+In information retrieval, Okapi BM25 (BM is an abbreviation of best matching)
+is a ranking function used by search engines to estimate the relevance of
+documents to a given search query.
+::::
+
+
+::::{grid-item}
+:class: rubric-slim
+:columns: auto 3 3 3
+
+```{rubric} Reference Manual
+```
+- [](inv:crate-reference#sql_dql_fulltext_search)
+- [](inv:crate-reference#fulltext-indices)
+- [](inv:crate-reference#ref-create-analyzer)
+
+```{rubric} Related
+```
+- {ref}`sql`
+- {ref}`vector`
+- {ref}`machine-learning`
+- {ref}`query`
+
+{tags-primary}`SQL`
+{tags-primary}`Full-Text Search`
+{tags-primary}`Okapi BM25`
+::::
+
+:::::
+
+
+:::{rubric} Details
+:::
+CrateDB uses Lucene as a storage layer, so it inherits the implementation
+and concepts of Lucene, in the same spirit as Elasticsearch.
+The now popular BM25 method has become the default scoring formula in Lucene
+and is the scoring formula used by CrateDB.
+
+BM25 stands for "Best Match 25", the 25th iteration of this scoring algorithm.
+The excellent article [BM25: The Next Generation of Lucene Relevance] compares
+classic TF/IDF to [Okapi BM25], including illustrative graphs.
+To learn more details about what's inside, please also refer to [Similarity in
+Elasticsearch] and [BM25 vs. Lucene Default Similarity].
+
+:::{div}
+While Elasticsearch uses a [query DSL based on JSON], in CrateDB, you can work
+with text search using SQL.
+:::
+
+
+## Synopsis
+
+Store and query word embeddings using similarity search based on Cosine
+distance.
+
+::::{grid}
+:padding: 0
+:class-row: title-slim
+
+:::{grid-item} **DDL**
+:columns: auto 6 6 6
+
+```sql
+CREATE TABLE documents (
+  name STRING PRIMARY KEY,
+  description TEXT,
+  INDEX ft_english
+    USING FULLTEXT(description) WITH (
+      analyzer = 'english'
+    ),
+  INDEX ft_german
+    USING FULLTEXT(description) WITH (
+      analyzer = 'german'
+    )
+);
+```
+:::
+
+:::{grid-item} **DQL**
+:columns: auto 6 6 6
+
+```sql
+SELECT name
+FROM documents
+WHERE MATCH(ft_english, 'jump');
+
+SELECT name
+FROM documents
+WHERE MATCH(ft_german, 'verwahrlost');
+```
+:::
+
+::::
+
+
+::::{grid}
+:padding: 0
+:class-row: title-slim
+
+:::{grid-item} **DML**
+:columns: auto 6 6 6
+
+```sql
+INSERT INTO documents (name, description)
+VALUES
+  ('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
+  ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
+;
+```
+:::
+
+:::{grid-item} **Result**
+:columns: auto 6 6 6
+
+```text
++-----------+
+| name      |
++-----------+
+| Quick fox |
++-----------+
+SELECT 1 row in set (0.004 sec)
+
++------------+
+| name       |
++------------+
+| Franz jagt |
++------------+
+SELECT 1 row in set (0.003 sec)
+```
+:::
+
+::::
+
+
+## Usage
+
+Using full-text search in CrateDB.
+
+:::{rubric} `MATCH` predicate
+:::
+CrateDB's [MATCH predicate] performs a fulltext search on one or more indexed
+columns or indices and supports different matching techniques.
+
+In order to use fulltext searches on a column, a [fulltext index with an
+analyzer] must be created for this column.
+
+:::{rubric} Query Language
+:::
+:::{todo}
+Illustrate capabilities of the Lucene query language.
+:::
+
+:::{rubric} Analyzer
+:::
+Analyzers consist of two parts, filters, and tokenizers. Each analyzer must
+contain one tokenizer and only one tokenizer can be used.
+
+Tokenizers decide how to divide the given text into parts. Filters perform
+a series of transformations by passing the given text through a number of
+operations. They are divided into token filters and character filters,
+discriminating between filters applied before, or after the tokenization
+step.
+
+Popular filters are stopword lists, lowercase transformations, or word
+stemmers.
+The excellent article [Improve Your Text Search with Lucene Analyzers]
+illustrates more details about this topic on behalf of Elasticsearch.
+
+
+
+## Learn
+
 Learn how to set up your database for full-text search, how to create the
 relevant indices, and how to query your text data efficiently. A must-read
 for anyone looking to make sense of large volumes of unstructured text data.
 
+:::{rubric} Tutorials
+:::
+
 - [](inv:cloud#full-text)
+- [Custom analyzer combining multiple token filters]
 
 
 :::{note}
@@ -17,3 +218,16 @@ data sets. One of its standout features are its full-text search capabilities,
 built on top of the powerful Lucene library. This makes it a great fit for
 organizing, searching, and analyzing extensive datasets.
 :::
+
+
+[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25
+[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
+[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
+[Custom analyzer combining multiple token filters]: https://community.cratedb.com/t/fuzzy-search-synonyms/889
+[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
+[Improve Your Text Search with Lucene Analyzers]: https://medium.com/@dagliberkay/elastic-text-search-6b778de9b753
+[MATCH predicate]: inv:crate-reference#predicates_match
+[Okapi BM25]: https://trec.nist.gov/pubs/trec3/papers/city.ps.gz
+[search engine]: https://en.wikipedia.org/wiki/Search_engine
+[Similarity in Elasticsearch]: https://www.elastic.co/blog/found-similarity-in-elasticsearch
+[TREC-3 proceedings]: https://trec.nist.gov/pubs/trec3/t3_proceedings.html