forked from elastic/elasticsearch-definitive-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ae0fbfb
commit ab0cdc9
Showing
9 changed files
with
749 additions
and
41 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,14 @@ | ||
|
||
include::240_Stopwords/10_Intro.asciidoc[] | ||
|
||
include::240_Stopwords/20_Using_stopwords.asciidoc[] | ||
|
||
include::240_Stopwords/30_Stopwords_and_performance.asciidoc[] | ||
|
||
include::240_Stopwords/40_Divide_and_conquer.asciidoc[] | ||
|
||
common terms query | ||
match query | ||
include::240_Stopwords/50_Phrase_queries.asciidoc[] | ||
|
||
relevance | ||
include::240_Stopwords/60_Common_grams.asciidoc[] | ||
|
||
bm25 | ||
include::240_Stopwords/70_Relevance.asciidoc[] | ||
|
||
common grams token filter |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
[[stopwords-performance]] | ||
=== Stopwords and performance | ||
|
||
The biggest disadvantage of keeping stopwords is that of performance. When | ||
Elasticsearch performs a full text search, it has to calculate the relevance | ||
`_score` on all matching documents in order to return the top 10 matches. | ||
|
||
While most words typically occur in much fewer than 0.1% of all documents, a | ||
few words like `the` may occur in almost all of them. Imagine you have an | ||
index of 1 million documents. A query for `quick brown fox` may match fewer | ||
than 1,000 documents. But a query for `the quick brown fox` has to score and | ||
sort almost all of the 1 million documents in your index, just in order to | ||
return the top 10! | ||
|
||
The problem is that `the quick brown fox` is really a query for `the OR quick | ||
OR brown OR fox` -- any document which contains nothing more than the almost | ||
meaningless term `the` is included in the resultset. What we need is a way of | ||
reducing the number of documents that need to be scored. | ||
|
||
[[stopwords-and]] | ||
==== `and` operator | ||
|
||
The easiest way to reduce the number of documents is simply to use the | ||
<<match-improving-precision,`and` operator>> with the `match` query, in order | ||
to make all words required. | ||
|
||
A `match` query like: | ||
|
||
[source,json] | ||
--------------------------------- | ||
{ | ||
"match": { | ||
"text": { | ||
"query": "the quick brown fox", | ||
"operator": "and" | ||
} | ||
} | ||
} | ||
--------------------------------- | ||
|
||
is rewritten as a `bool` query like: | ||
|
||
[source,json] | ||
--------------------------------- | ||
{ | ||
"bool": { | ||
"must": [ | ||
{ "term": { "text": "the" }}, | ||
{ "term": { "text": "quick" }}, | ||
{ "term": { "text": "brown" }}, | ||
{ "term": { "text": "fox" }} | ||
] | ||
} | ||
} | ||
--------------------------------- | ||
|
||
The `bool` query is intelligent enough to execute each `term` query in the | ||
optimal order -- it starts with the least frequent term. Because all terms | ||
are required, only documents that contain the least frequent term can possibly | ||
match. Using the `and` operator greatly speeds up multi-term queries. | ||
|
||
==== `minimum_should_match` | ||
|
||
In <<match-precision>> we discussed using the `minimum_should_match` operator | ||
to trim the long tail of less relevant results. It is useful for this purpose | ||
alone but, as a nice side effect, it offers a similar performance benefit to | ||
the `and` operator: | ||
|
||
[source,json] | ||
--------------------------------- | ||
{ | ||
"match": { | ||
"text": { | ||
"query": "the quick brown fox", | ||
"minimum_should_match": "75%" | ||
} | ||
} | ||
} | ||
--------------------------------- | ||
|
||
In this example, at least three out of the four terms must match. This means | ||
that the only docs that need to be considered are those that contain either the least or second least frequent terms. | ||
|
||
This offers a huge performance gain over a simple query with the default `or` | ||
operator! But we can do better yet... | ||
|
Oops, something went wrong.