Added Stopwords chapter

jnguyenx · Jun 26, 2014 · ab0cdc9 · ab0cdc9
1 parent ae0fbfb
commit ab0cdc9
Show file tree

Hide file tree

Showing 9 changed files with 749 additions and 41 deletions.
diff --git a/100_Full_Text_Search/10_Multi_word_queries.asciidoc b/100_Full_Text_Search/10_Multi_word_queries.asciidoc
@@ -73,6 +73,7 @@ The important thing to take away from the above is that any document whose
 `title` field contains *at least one of the specified terms* will match the
 query.  The more terms that match, the more relevant the document.
 
+[[match-improving-precision]]
 ==== Improving precision
 
 Matching any document which contains *any* of the query terms may result in  a

diff --git a/240_Stopwords.asciidoc b/240_Stopwords.asciidoc
@@ -1,14 +1,14 @@
-
 include::240_Stopwords/10_Intro.asciidoc[]
 
 include::240_Stopwords/20_Using_stopwords.asciidoc[]
 
+include::240_Stopwords/30_Stopwords_and_performance.asciidoc[]
+
+include::240_Stopwords/40_Divide_and_conquer.asciidoc[]
 
-common terms query
-match query
+include::240_Stopwords/50_Phrase_queries.asciidoc[]
 
-relevance
+include::240_Stopwords/60_Common_grams.asciidoc[]
 
-bm25
+include::240_Stopwords/70_Relevance.asciidoc[]
 
-common grams token filter
diff --git a/240_Stopwords/10_Intro.asciidoc b/240_Stopwords/10_Intro.asciidoc
@@ -51,16 +51,18 @@ stopwords used in Elasticsearch are:
 These _stopwords_ can usually be filtered out before indexing with little
 negative impact on retrieval. But is it a good idea to do so?
 
+[[pros-cons-stopwords]]
 [float]
 === Pros and cons of stopwords
 
 We have more disk space, more RAM, and better compression algorithms than
 existed back in the day. Excluding the above 33 common words from the index
 will only save about 4MB per million documents.  Using stopwords for the sake
-of reducing index size is no longer a valid reason.
+of reducing index size is no longer a valid reason. (Although, there is one
+caveat to this statement which we will discuss in <<stopwords-phrases>>.)
 
 On top of that, by removing words from the index we are reducing our ability
-to perform certain types of search.  Filtering out the above stopwords
+to perform certain types of search.  Filtering out the words listed above
 prevents us from:
 
 * distinguishing ``happy'' from ``not happy''.
@@ -78,8 +80,8 @@ the `_score` for all 1 million documents.  This second query simply cannot
 perform as well as the first.
 
 Fortunately, there are techniques which we can use to keep common words
-searchable, while benefiting from the performance gain of stopwords. First,
-let's start with how to use stopwords.
+searchable, while still maintaining good performance. First, we'll start with
+how to use stopwords.
 
 
 
diff --git a/240_Stopwords/20_Using_stopwords.asciidoc b/240_Stopwords/20_Using_stopwords.asciidoc
@@ -1,13 +1,10 @@
-:ref: http://foo.com/
-
 [[using-stopwords]]
 === Using stopwords
 
 The removal of stopwords is handled by the
 {ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used
-when creating a `custom` analyzer, as described below in <<stop-token-filter>>.
-However, some out-of-the-box analyzers have the `stop` filter integrated
-already:
+when creating a `custom` analyzer (see <<stop-token-filter>> below).
+However, some out-of-the-box analyzers come with the `stop` filter pre-integrated:
 
 {ref}analysis-lang-analyzer.html[Language analyzers]::
 
@@ -28,7 +25,7 @@ already:
 
 To use custom stopwords in conjunction with the `standard` analyzer, all we
 need to do is to create a configured version of the analyzer and pass in the
-list of `stopwords that we require:
+list of `stopwords` that we require:
 
 [source,json]
 ---------------------------------
@@ -39,19 +36,21 @@ PUT /my_index
       "analyzer": {
         "my_analyzer": { <1>
           "type": "standard", <2>
-          "stopwords": [ <3>
-            "and",<3>
-            "the"
-          ]
-}}}}}
+          "stopwords": [ "and", "the" ] <3>
+        }
+      }
+    }
+  }
+}
 ---------------------------------
 <1> This is a custom analyzer called `my_analyzer`.
 <2> This analyzer is the `standard` analyzer with some custom configuration.
 <3> The stopwords to filter out are `and` and `the`.
 
-TIP: The same technique can be used to configure custom stopword lists for
+TIP: This same technique can be used to configure custom stopword lists for
 any of the language analyzers.
 
+[[maintaining-positions]]
 ==== Maintaining positions
 
 The output from the `analyze` API is quite interesting:
@@ -92,6 +91,7 @@ important for phrase queries -- if the positions of each term had been
 adjusted, then a phrase query for `"quick dead"` would have matched the above
 example incorrectly.
 
+[[specifying-stopwords]]
 ==== Specifying stopwords
 
 Stopwords can be passed inline, as we did in the previous example, by
@@ -150,23 +150,27 @@ PUT /my_index
       "analyzer": {
         "my_english": {
           "type":           "english",
-          "stopwords_path": "config/stopwords/english.txt" <1>
+          "stopwords_path": "stopwords/english.txt" <1>
         }
       }
     }
   }
 }
 ---------------------------------
-<1> The path to the stopwords file, relative to the Elasticsearch directory.
+<1> The path to the stopwords file, relative to the Elasticsearch `config`
+    directory.
 
 [[stop-token-filter]]
 ==== Using the `stop` token filter
 
-The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used
-directly when you need to create a `custom` analyzer.  For instance, let's say
-that we wanted to create a Spanish analyzer with a custom stopwords list
-and the `light_spanish` stemmer, which also
-<<asciifolding-token-filter,removes diacritics>>.
+The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be combined
+with a tokenizer and other token filters when you need to create a `custom`
+analyzer.  For instance, let's say that we wanted to create a Spanish analyzer
+with:
+
+* a custom stopwords list.
+* the `light_spanish` stemmer.
+* the <<asciifolding-token-filter,`asciifolding` filter>> to remove diacritics.
 
 We could set that up as follows:
 
@@ -203,22 +207,22 @@ PUT /my_index
 ---------------------------------
 <1> The `stop` token filter takes the same `stopwords` and `stopwords_path`
     parameters as the `standard` analyzer.
-<2> See <<using-an-algorithmic-stemmer>>.
-<3> The order of token filters is important, see below.
+<2> See <<algorithmic-stemmers>>.
+<3> The order of token filters is important, as explained below.
 
-The `spanish_stop` filter comes after the `asciifolding` filter.  This means
-that `esta`, `èsta` and ++està++ will first have their diacritics removed to
-become just `esta`, which is removed as a stopword. If, instead, we wanted to
-remove `esta` and `èsta`, but not ++està++,  then we would have to put the
-`spanish_stop` filter *before* the `asciifolding` filter, and specify both
-words in the stopwords list.
+We have placed the `spanish_stop` filter after the `asciifolding` filter. This
+means that `esta`, `ésta` and ++está++ will first have their diacritics
+removed to become just `esta`, which will then be removed as a stopword. If,
+instead, we wanted to remove `esta` and `ésta`, but not ++está++,  then we
+would have to put the `spanish_stop` filter *before* the `asciifolding`
+filter, and specify both words in the stopwords list.
 
 [[updating-stopwords]]
 ==== Updating stopwords
 
 There are a few techniques which can be used to update the list of stopwords
-in use. Analyzers are instantiated at index creation time, when a node is
-restarted, or when a closed index is reopened.
+used by an analyzer. Analyzers are instantiated at index creation time, when a
+node is restarted, or when a closed index is reopened.
 
 If you specify stopwords inline with the `stopwords` parameter, then your
 only option is to close the index, update the analyzer configuration with the
@@ -227,13 +231,13 @@ the index.
 
 Updating stopwords is easier if you specify them in a file with the
 `stopwords_path` parameter.  You can just update the file (on every node in
-the cluster) then force the analyzers to be recreated by:
+the cluster) then force the analyzers to be recreated by either:
 
 * closing and reopening the index
   (see {ref}indices-open-close.html[open/close index]), or
 * restarting each node in the cluster, one by one.
 
 Of course, updating the stopwords list will not change any documents that have
-already been indexed.  It will only apply to searches and to new or updated
+already been indexed -- it will only apply to searches and to new or updated
 documents.  To apply the changes to existing documents you will need to
 reindex your data. See <<reindex>>
diff --git a/240_Stopwords/30_Stopwords_and_performance.asciidoc b/240_Stopwords/30_Stopwords_and_performance.asciidoc
@@ -0,0 +1,86 @@
+[[stopwords-performance]]
+=== Stopwords and performance
+
+The biggest disadvantage of keeping stopwords is that of performance. When
+Elasticsearch performs a full text search, it has to calculate the relevance
+`_score` on all matching documents in order to return the top 10 matches.
+
+While most words typically occur in much fewer than 0.1% of all documents, a
+few words like `the` may occur in almost all of them.  Imagine you have an
+index of 1 million documents.  A query for `quick brown fox` may match  fewer
+than 1,000 documents.  But a query for `the quick brown fox` has to score and
+sort almost all of the 1 million documents in your index, just in order to
+return the top 10!
+
+The problem is that `the quick brown fox` is really a query for `the OR quick
+OR brown OR fox` -- any document which contains nothing more than the almost
+meaningless term `the` is included in the resultset.  What we need is a way of
+reducing the number of documents that need to be scored.
+
+[[stopwords-and]]
+==== `and` operator
+
+The easiest way to reduce the number of documents is simply to use the
+<<match-improving-precision,`and` operator>> with the `match` query, in order
+to make all words required.
+
+A `match` query like:
+
+[source,json]
+---------------------------------
+{
+    "match": {
+        "text": {
+            "query":    "the quick brown fox",
+            "operator": "and"
+        }
+    }
+}
+---------------------------------
+
+is rewritten as a `bool` query like:
+
+[source,json]
+---------------------------------
+{
+    "bool": {
+        "must": [
+            { "term": { "text": "the" }},
+            { "term": { "text": "quick" }},
+            { "term": { "text": "brown" }},
+            { "term": { "text": "fox" }}
+        ]
+    }
+}
+---------------------------------
+
+The `bool` query is intelligent enough to execute each `term` query in the
+optimal order -- it starts with the least frequent term.  Because all terms
+are required, only documents that contain the least frequent term can possibly
+match. Using the `and` operator greatly speeds up multi-term queries.
+
+==== `minimum_should_match`
+
+In <<match-precision>> we discussed using the `minimum_should_match` operator
+to trim the long tail of less relevant results.  It is useful for this purpose
+alone but, as a nice side effect, it offers a similar performance benefit to
+the `and` operator:
+
+[source,json]
+---------------------------------
+{
+    "match": {
+        "text": {
+            "query": "the quick brown fox",
+            "minimum_should_match": "75%"
+        }
+    }
+}
+---------------------------------
+
+In this example, at least three out of the four terms must match. This means
+that the only docs that need to be considered are those that contain either the least or second least frequent terms.
+
+This offers a huge performance gain over a simple query with the default `or`
+operator!  But we can do better yet...
+