From ab0cdc96dfe40d2c0926c24b3d41b7080cf65d97 Mon Sep 17 00:00:00 2001 From: Clinton Gormley Date: Thu, 26 Jun 2014 14:52:41 +0200 Subject: [PATCH] Added Stopwords chapter --- .../10_Multi_word_queries.asciidoc | 1 + 240_Stopwords.asciidoc | 12 +- 240_Stopwords/10_Intro.asciidoc | 10 +- 240_Stopwords/20_Using_stopwords.asciidoc | 66 +++--- .../30_Stopwords_and_performance.asciidoc | 86 ++++++++ 240_Stopwords/40_Divide_and_conquer.asciidoc | 193 +++++++++++++++++ 240_Stopwords/50_Phrase_queries.asciidoc | 147 +++++++++++++ 240_Stopwords/60_Common_grams.asciidoc | 203 ++++++++++++++++++ 240_Stopwords/70_Relevance.asciidoc | 72 +++++++ 9 files changed, 749 insertions(+), 41 deletions(-) create mode 100644 240_Stopwords/30_Stopwords_and_performance.asciidoc create mode 100644 240_Stopwords/40_Divide_and_conquer.asciidoc create mode 100644 240_Stopwords/50_Phrase_queries.asciidoc create mode 100644 240_Stopwords/60_Common_grams.asciidoc create mode 100644 240_Stopwords/70_Relevance.asciidoc diff --git a/100_Full_Text_Search/10_Multi_word_queries.asciidoc b/100_Full_Text_Search/10_Multi_word_queries.asciidoc index ab22bc3b0..e59c06426 100644 --- a/100_Full_Text_Search/10_Multi_word_queries.asciidoc +++ b/100_Full_Text_Search/10_Multi_word_queries.asciidoc @@ -73,6 +73,7 @@ The important thing to take away from the above is that any document whose `title` field contains *at least one of the specified terms* will match the query. The more terms that match, the more relevant the document. +[[match-improving-precision]] ==== Improving precision Matching any document which contains *any* of the query terms may result in a diff --git a/240_Stopwords.asciidoc b/240_Stopwords.asciidoc index ef0be723e..cd4dd24af 100644 --- a/240_Stopwords.asciidoc +++ b/240_Stopwords.asciidoc @@ -1,14 +1,14 @@ - include::240_Stopwords/10_Intro.asciidoc[] include::240_Stopwords/20_Using_stopwords.asciidoc[] +include::240_Stopwords/30_Stopwords_and_performance.asciidoc[] + +include::240_Stopwords/40_Divide_and_conquer.asciidoc[] -common terms query -match query +include::240_Stopwords/50_Phrase_queries.asciidoc[] -relevance +include::240_Stopwords/60_Common_grams.asciidoc[] -bm25 +include::240_Stopwords/70_Relevance.asciidoc[] -common grams token filter diff --git a/240_Stopwords/10_Intro.asciidoc b/240_Stopwords/10_Intro.asciidoc index 3560ab47c..5d7771822 100644 --- a/240_Stopwords/10_Intro.asciidoc +++ b/240_Stopwords/10_Intro.asciidoc @@ -51,16 +51,18 @@ stopwords used in Elasticsearch are: These _stopwords_ can usually be filtered out before indexing with little negative impact on retrieval. But is it a good idea to do so? +[[pros-cons-stopwords]] [float] === Pros and cons of stopwords We have more disk space, more RAM, and better compression algorithms than existed back in the day. Excluding the above 33 common words from the index will only save about 4MB per million documents. Using stopwords for the sake -of reducing index size is no longer a valid reason. +of reducing index size is no longer a valid reason. (Although, there is one +caveat to this statement which we will discuss in <>.) On top of that, by removing words from the index we are reducing our ability -to perform certain types of search. Filtering out the above stopwords +to perform certain types of search. Filtering out the words listed above prevents us from: * distinguishing ``happy'' from ``not happy''. @@ -78,8 +80,8 @@ the `_score` for all 1 million documents. This second query simply cannot perform as well as the first. Fortunately, there are techniques which we can use to keep common words -searchable, while benefiting from the performance gain of stopwords. First, -let's start with how to use stopwords. +searchable, while still maintaining good performance. First, we'll start with +how to use stopwords. diff --git a/240_Stopwords/20_Using_stopwords.asciidoc b/240_Stopwords/20_Using_stopwords.asciidoc index b6e70b4e0..6e51b4d63 100644 --- a/240_Stopwords/20_Using_stopwords.asciidoc +++ b/240_Stopwords/20_Using_stopwords.asciidoc @@ -1,13 +1,10 @@ -:ref: http://foo.com/ - [[using-stopwords]] === Using stopwords The removal of stopwords is handled by the {ref}analysis-stop-tokenfilter.html[`stop` token filter] which can be used -when creating a `custom` analyzer, as described below in <>. -However, some out-of-the-box analyzers have the `stop` filter integrated -already: +when creating a `custom` analyzer (see <> below). +However, some out-of-the-box analyzers come with the `stop` filter pre-integrated: {ref}analysis-lang-analyzer.html[Language analyzers]:: @@ -28,7 +25,7 @@ already: To use custom stopwords in conjunction with the `standard` analyzer, all we need to do is to create a configured version of the analyzer and pass in the -list of `stopwords that we require: +list of `stopwords` that we require: [source,json] --------------------------------- @@ -39,19 +36,21 @@ PUT /my_index "analyzer": { "my_analyzer": { <1> "type": "standard", <2> - "stopwords": [ <3> - "and",<3> - "the" - ] -}}}}} + "stopwords": [ "and", "the" ] <3> + } + } + } + } +} --------------------------------- <1> This is a custom analyzer called `my_analyzer`. <2> This analyzer is the `standard` analyzer with some custom configuration. <3> The stopwords to filter out are `and` and `the`. -TIP: The same technique can be used to configure custom stopword lists for +TIP: This same technique can be used to configure custom stopword lists for any of the language analyzers. +[[maintaining-positions]] ==== Maintaining positions The output from the `analyze` API is quite interesting: @@ -92,6 +91,7 @@ important for phrase queries -- if the positions of each term had been adjusted, then a phrase query for `"quick dead"` would have matched the above example incorrectly. +[[specifying-stopwords]] ==== Specifying stopwords Stopwords can be passed inline, as we did in the previous example, by @@ -150,23 +150,27 @@ PUT /my_index "analyzer": { "my_english": { "type": "english", - "stopwords_path": "config/stopwords/english.txt" <1> + "stopwords_path": "stopwords/english.txt" <1> } } } } } --------------------------------- -<1> The path to the stopwords file, relative to the Elasticsearch directory. +<1> The path to the stopwords file, relative to the Elasticsearch `config` + directory. [[stop-token-filter]] ==== Using the `stop` token filter -The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be used -directly when you need to create a `custom` analyzer. For instance, let's say -that we wanted to create a Spanish analyzer with a custom stopwords list -and the `light_spanish` stemmer, which also -<>. +The {ref}analysis-stop-tokenfilter.html[`stop` token filter] can be combined +with a tokenizer and other token filters when you need to create a `custom` +analyzer. For instance, let's say that we wanted to create a Spanish analyzer +with: + +* a custom stopwords list. +* the `light_spanish` stemmer. +* the <> to remove diacritics. We could set that up as follows: @@ -203,22 +207,22 @@ PUT /my_index --------------------------------- <1> The `stop` token filter takes the same `stopwords` and `stopwords_path` parameters as the `standard` analyzer. -<2> See <>. -<3> The order of token filters is important, see below. +<2> See <>. +<3> The order of token filters is important, as explained below. -The `spanish_stop` filter comes after the `asciifolding` filter. This means -that `esta`, `èsta` and ++està++ will first have their diacritics removed to -become just `esta`, which is removed as a stopword. If, instead, we wanted to -remove `esta` and `èsta`, but not ++està++, then we would have to put the -`spanish_stop` filter *before* the `asciifolding` filter, and specify both -words in the stopwords list. +We have placed the `spanish_stop` filter after the `asciifolding` filter. This +means that `esta`, `ésta` and ++está++ will first have their diacritics +removed to become just `esta`, which will then be removed as a stopword. If, +instead, we wanted to remove `esta` and `ésta`, but not ++está++, then we +would have to put the `spanish_stop` filter *before* the `asciifolding` +filter, and specify both words in the stopwords list. [[updating-stopwords]] ==== Updating stopwords There are a few techniques which can be used to update the list of stopwords -in use. Analyzers are instantiated at index creation time, when a node is -restarted, or when a closed index is reopened. +used by an analyzer. Analyzers are instantiated at index creation time, when a +node is restarted, or when a closed index is reopened. If you specify stopwords inline with the `stopwords` parameter, then your only option is to close the index, update the analyzer configuration with the @@ -227,13 +231,13 @@ the index. Updating stopwords is easier if you specify them in a file with the `stopwords_path` parameter. You can just update the file (on every node in -the cluster) then force the analyzers to be recreated by: +the cluster) then force the analyzers to be recreated by either: * closing and reopening the index (see {ref}indices-open-close.html[open/close index]), or * restarting each node in the cluster, one by one. Of course, updating the stopwords list will not change any documents that have -already been indexed. It will only apply to searches and to new or updated +already been indexed -- it will only apply to searches and to new or updated documents. To apply the changes to existing documents you will need to reindex your data. See <> diff --git a/240_Stopwords/30_Stopwords_and_performance.asciidoc b/240_Stopwords/30_Stopwords_and_performance.asciidoc new file mode 100644 index 000000000..d3367a9a9 --- /dev/null +++ b/240_Stopwords/30_Stopwords_and_performance.asciidoc @@ -0,0 +1,86 @@ +[[stopwords-performance]] +=== Stopwords and performance + +The biggest disadvantage of keeping stopwords is that of performance. When +Elasticsearch performs a full text search, it has to calculate the relevance +`_score` on all matching documents in order to return the top 10 matches. + +While most words typically occur in much fewer than 0.1% of all documents, a +few words like `the` may occur in almost all of them. Imagine you have an +index of 1 million documents. A query for `quick brown fox` may match fewer +than 1,000 documents. But a query for `the quick brown fox` has to score and +sort almost all of the 1 million documents in your index, just in order to +return the top 10! + +The problem is that `the quick brown fox` is really a query for `the OR quick +OR brown OR fox` -- any document which contains nothing more than the almost +meaningless term `the` is included in the resultset. What we need is a way of +reducing the number of documents that need to be scored. + +[[stopwords-and]] +==== `and` operator + +The easiest way to reduce the number of documents is simply to use the +<> with the `match` query, in order +to make all words required. + +A `match` query like: + +[source,json] +--------------------------------- +{ + "match": { + "text": { + "query": "the quick brown fox", + "operator": "and" + } + } +} +--------------------------------- + +is rewritten as a `bool` query like: + +[source,json] +--------------------------------- +{ + "bool": { + "must": [ + { "term": { "text": "the" }}, + { "term": { "text": "quick" }}, + { "term": { "text": "brown" }}, + { "term": { "text": "fox" }} + ] + } +} +--------------------------------- + +The `bool` query is intelligent enough to execute each `term` query in the +optimal order -- it starts with the least frequent term. Because all terms +are required, only documents that contain the least frequent term can possibly +match. Using the `and` operator greatly speeds up multi-term queries. + +==== `minimum_should_match` + +In <> we discussed using the `minimum_should_match` operator +to trim the long tail of less relevant results. It is useful for this purpose +alone but, as a nice side effect, it offers a similar performance benefit to +the `and` operator: + +[source,json] +--------------------------------- +{ + "match": { + "text": { + "query": "the quick brown fox", + "minimum_should_match": "75%" + } + } +} +--------------------------------- + +In this example, at least three out of the four terms must match. This means +that the only docs that need to be considered are those that contain either the least or second least frequent terms. + +This offers a huge performance gain over a simple query with the default `or` +operator! But we can do better yet... + diff --git a/240_Stopwords/40_Divide_and_conquer.asciidoc b/240_Stopwords/40_Divide_and_conquer.asciidoc new file mode 100644 index 000000000..604a29ebe --- /dev/null +++ b/240_Stopwords/40_Divide_and_conquer.asciidoc @@ -0,0 +1,193 @@ +[[common-terms]] +=== Divide and conquer + +The terms in a query string can be divided into more important (low frequency) +and less important (high frequency) terms. Documents that match only the less +important terms are probably of very little interest. Really, we want +documents that match as many of the more important terms as possible. + +The `match` query accepts a `cutoff_frequency` parameter, which allows it to +divide the terms in the query string into a low frequency and high frequency +group. The low frequency group (more important terms) form the bulk of the +query, while the high frequency group (less important terms) is used only for +scoring, not for matching. By treating these two groups differently, we can +gain a real boost of speed on previously slow queries. + +.Domain specific stopwords +********************************************* + +One of the benefits of `cutoff_frequency` is that you get _domain specific_ +stopwords for free. For instance, a website about movies may use the words +``movie'', ``color'', ``black'' and ``white'' so often that they could be +considered almost meaningless. With the `stop` token filter, these domain +specific terms would have to be added to the stopwords list manually. However, +because the `cutoff_frequency` looks at the actual frequency of terms in the +index, these words would be classified as _high frequency_ automatically. + +********************************************* + +Take this query as an example: + +[source,json] +--------------------------------- +{ + "match": { + "text": { + "query": "Quick and the dead", + "cutoff_frequency": 0.01 <1> + } +} +--------------------------------- +<1> Any term that occurs in more than 1% of documents is considered to be high + frequency. The `cutoff_frequency` can be specified as a fraction (`0.01`) + or as an absolute number (`5`). + +This query uses the `cutoff_frequency` to first divide the query terms into a +low frequency group: (`quick`, `dead`), and a high frequency group: (`and`, +`the`). Then, the query is rewritten to produce the following `bool` query: + +[source,json] +--------------------------------- +{ + "bool": { + "must": { <1> + "bool": { + "should": [ + { "term": { "text": "quick" }}, + { "term": { "text": "dead" }} + ] + } + }, + "should": { <2> + "bool": { + "should": [ + { "term": { "text": "and" }}, + { "term": { "text": "the" }} + ] + } + } + } +} +--------------------------------- +<1> At least one low frequency / high importance term *must* match. +<2> High frequency / low importance terms are entirely optional. + +The `must` clause means that at least one of the low frequency terms -- +`quick` or `dead` -- *must* be present for a document to be considered a +match. All other documents are excluded. The `should` clause then looks for +the high frequency terms `and` and `the`, but only in the documents collected +by the `must` clause. The sole job of the `should` clause is to score a +document like ``Quick **AND THE** dead'' higher than ``**THE** quick but +dead''. This approach greatly reduces the number of documents that need to be +examined and scored. + +.`and` query +******************************** + +Setting the operator parameter to `and` would simply make all low and high +frequency terms required. As we saw in <>, this is already an +efficient query. + +******************************** + +==== Controlling precision + +The `minimum_should_match` parameter can be combined with `cutoff_frequency` +but it only applies to the low frequency terms. This query: + +[source,json] +--------------------------------- +{ + "match": { + "text": { + "query": "Quick and the dead", + "cutoff_frequency": 0.01, + "minimum_should_match": "75%" + } +} +--------------------------------- + +would be rewritten as: + +[source,json] +--------------------------------- +{ + "bool": { + "must": { + "bool": { + "should": [ + { "term": { "text": "quick" }}, + { "term": { "text": "dead" }} + ], + "minimum_should_match": 1 <1> + } + }, + "should": { <2> + "bool": { + "should": [ + { "term": { "text": "and" }}, + { "term": { "text": "the" }} + ] + } + } + } +} +--------------------------------- +<1> Because there are only two terms, the original 75% is rounded down + to `1`, that is: ``1 out of 2 low frequency terms must match''. +<2> The high frequency terms are still optional and used only for scoring. + +==== Only high frequency terms + +An `or` query for high frequency terms only -- ``To be or not to be'' -- is +the worst case for performance. It is pointless to score *all* of the +documents that contain only one of these terms in order to return just the top +ten matches. We are really only interested in documents where they all occur +together, so in the case where there are no low frequency terms, the query is +rewritten to make all high frequency terms required: + +[source,json] +--------------------------------- +{ + "bool": { + "must": [ + { "term": { "text": "to" }}, + { "term": { "text": "be" }}, + { "term": { "text": "or" }}, + { "term": { "text": "not" }}, + { "term": { "text": "to" }}, + { "term": { "text": "be" }} + ] + } +} +--------------------------------- + +==== More control with `common` terms + +While the high/low frequency functionality in the `match` query is useful, +sometimes you want more control over how the high and low frequency groups +should be handled. The `match` query just exposes a subset of the +functionality available in the `common` terms query. + +For instance, we could make all low frequency terms, and 75% of high +frequency terms required with a query like this: + +[source,json] +--------------------------------- +{ + "common": { + "text": { + "query": "Quick and the dead", + "cutoff_frequency": 0.01, + "low_freq_operator": "and", + "minimum_should_match": { + "high_freq": "75%" + } + } + } +} +--------------------------------- + +See the {ref}query-dsl-common-terms-query.html[`common` terms query] reference +page for more options. + diff --git a/240_Stopwords/50_Phrase_queries.asciidoc b/240_Stopwords/50_Phrase_queries.asciidoc new file mode 100644 index 000000000..a29eceee0 --- /dev/null +++ b/240_Stopwords/50_Phrase_queries.asciidoc @@ -0,0 +1,147 @@ +[[stopwords-phrases]] +=== Stopwords and phrase queries + +About 5% of all queries are phrase queries (see <>), but they +often account for the majority of slow queries. Phrase queries can perform +poorly, especially if the phrase includes very common words -- a phrase like +``To be or not to be'' could be considered pathological. The reason for this +has to do with the amount of data that is necessary to support proximity +matching. + +In <> we said that removing stopwords saves only a small +amount of space in the inverted index. That was only partially true. A +typical index may contain, amongst other data, some or all of: + +Terms dictionary:: + + A sorted list of all terms that appear in the documents in the index, + and a count of how many documents contain each term. + +Postings list:: + + A list of which documents contain each term. + +Term frequency:: + + How often each term appears in each document. + +Positions:: + + The position of each term within each document, for phrase and proximity + queries. + +Offsets:: + + The start and end character offsets of each term in each document, for + snippet highlighting. Disabled by default. + +Norms:: + + A factor used to normalize fields of different lengths, to give shorter + fields more weight. + +Removing stopwords from the index may save a small amount of space in the +_terms dictionary_ and the _postings list_, but _positions_ and _offsets_ are +another matter. Positions and offsets data can easily double, triple, or +quadruple index size. + +==== Positions data + +Positions are enabled on `analyzed` string fields by default, so that phrase +queries will work out of the box. The more often that a term appears, the more +space that is needed to store its position data. Very common words, by +definition, appear very commonly and their positions data can run to megabytes +or gigabytes on large corpuses. + +Running a phrase query on a high frequency word like `the` might result in +gigabytes of data being read from disk. That data will be stored in the kernel +file system cache to speed up later access, which seems like a good thing, but +it might cause other data to be evicted from the cache which will slow down +subsequent queries. + +This is clearly a problem that needs solving. + +[[index-options]] +==== Index options + +The first question you should ask yourself is: ``**Do you need phrase or +proximity queries?**'' + +Often, the answer is no. For many use cases, such as logging, you need to +know *whether* a term appears in a document -- information which is provided +by the postings list -- but not *where* it appears. Or perhaps you need to use +phrase queries on one or two fields, but you can disable positions data on all +of the other analyzed `string` fields. + +The `index_options` parameter allows you to control what information is stored +in the index for each field. Valid values are: + +`docs`:: + + Only store which documents contain which terms. This is the default for + `not_analyzed` string fields. + +`freqs`:: + + Store `docs` information, plus how often each term appears in each + document. Term frequencies are needed for a complete <> + relevance calculations, but they are not required if you just need to know + whether a document contains a particular term or not. + +`positions`:: + + Store `docs` and `freqs`, plus the position of each term in each document. + This is the default for `analyzed` string fields, but can be disabled if + phrase/proximity matching is not needed. + +`offsets`:: + + Store `docs`, `freqs`, `positions` and the start and end character offsets + of each term in the original string. This information is used by the + {ref}postings-highlighter.html[`postings` highlighter] but is disabled + by default. + +You can set `index_options` on fields added at index creation time, or when +adding new fields using the `put-mapping` API. This setting can't be changed +on existing fields: + +[source,json] +--------------------------------- +PUT /my_index +{ + "mappings": { + "my_type": { + "properties": { + "title": { <1> + "type": "string" + }, + "content": { <2> + "type": "string", + "index_options": "freqs" + } + } + } +} +--------------------------------- +<1> The `title` field uses the default setting of `positions`, so + it is suitable for phrase/proximity queries. +<2> The `content` field has positions disabled and so cannot be used + for phrase/proximity queries. + +==== Stopwords + +Removing stopwords is one way of reducing the size of the positions data quite +dramatically. An index with stopwords removed can still be used for phrase +queries because the original positions of the remaining terms is maintained, +as we saw in <>. But of course, excluding terms from +the index reduces searchability. We wouldn't be able to differentiate between +the two phrases ``Man in the moon'' and ``Man on the moon''. + +Fortunately, there is a way to have our cake and eat it: the +<>. + + + + + + diff --git a/240_Stopwords/60_Common_grams.asciidoc b/240_Stopwords/60_Common_grams.asciidoc new file mode 100644 index 000000000..d6576319e --- /dev/null +++ b/240_Stopwords/60_Common_grams.asciidoc @@ -0,0 +1,203 @@ +:ref: http://foo.com/ + +[[common-grams]] +=== `common_grams` token filter + +The `common_grams` token filter is designed to make phrase queries with +stopwords more efficient. It is similar to the `shingles` token filter, (see +<>) which creates _bigrams_ out of every pair of adjacent words. It +is most easily explained by example. + +The `common_grams` token filter produces different out depending on whether +`query_mode` is set to `false` (for indexing) or to `true` (for searching), so +we have to create two separate analyzers: + +[source,json] +------------------------------- +PUT /my_index +{ + "settings": { + "analysis": { + "filter": { + "index_filter": { <1> + "type": "common_grams", + "common_words": "_english_" <2> + }, + "search_filter": { <1> + "type": "common_grams", + "common_words": "_english_", <2> + "query_mode": true + } + }, + "analyzer": { + "index_grams": { <3> + "tokenizer": "standard", + "filter": [ "lowercase", "index_filter" ] + }, + "search_grams": { <3> + "tokenizer": "standard", + "filter": [ "lowercase", "search_filter" ] + } + } + } + } +} +------------------------------- + +<1> First we create two token filters based on the `common_grams` token + filter: `index_filter` for index time (with `query_mode` set to the + default `false`), and `search_filter` for query time (with `query_mode` + set to `true`). + +<2> The `common_words` parameter accepts the same options as the `stopwords` + parameter (see <>). The filter also + accepts a `common_words_path` parameter which allows you to maintain the + common words list in a file. + +<3> Then we use each filter to create an analyzer for index time and another + for query time. + +With our custom analyzers in place, we can create a field which will use the +`index_grams` analyzer at index time: + +[source,json] +------------------------------- +PUT /my_index/_mapping/my_type +{ + "properties": { + "text": { + "type": "string", + "index_analyzer": "index_grams", <1> + "search_analyzer": "standard" <1> + } + } +} +------------------------------- +<1> The `text` field uses the `index_grams` analyzer at index time, but + defaults to using the `standard` analyzer at search time, for reasons we + will explain below. + +==== At index time + +If we were to analyze the phrase ``The quick and brown fox'' with shingles, it +would produce these terms: + +[source,text] +------------------------------- +Pos 1: the_quick +Pos 2: quick_and +Pos 3: and_brown +Pos 4: brown_fox +------------------------------- + +Our new `index_grams` analyzer produces the following terms instead: + +[source,text] +------------------------------- +Pos 1: the, the_quick +Pos 2: quick, quick_and +Pos 3: and, and_brown +Pos 4: brown +Pos 5: fox +------------------------------- + +All terms are output as unigrams -- `the`, `quick` etc -- but if a word is a +common word or is followed by a common word, then it also outputs a bigram in +the same position as the unigram -- `the_quick`, `quick_and`, `and_brown`. + +==== Unigram queries + +Because the index contains unigrams, the field can be queried using the same +techniques that we have used for any other field, for example: + +[source,json] +------------------------------- +GET /my_index/_search +{ + "query": { + "match": { + "text": { + "query": "the quick and brown fox", + "cutoff_frequency": 0.01 + } + } + } +} +------------------------------- + +The above query string is analyzed by the `search_analyzer` configured for the +`text` field -- the `standard` analyzer in this example -- to produce the +terms: `the`, `quick`, `and`, `brown`, `fox`. + +Because the index for the `text` field contains the same unigrams as produced +by the `standard` analyzer, search functions like it would for any normal +field. + +==== Bigram phrase queries + +However, when we come to do phrase queries, we can use the specialized +`search_grams` analyzer to make the process much more efficient: + +[source,json] +------------------------------- +GET /my_index/_search +{ + "query": { + "match_phrase": { + "text": { + "query": "The quick and brown fox", + "analyzer": "search_grams" <1> + } + } + } +} + +------------------------------- +<1> For phrase queries, we override the default `search_analyzer` and use the + `search_grams` analyzer instead. + +The `search_grams` analyzer would produce the following terms: + +[source,text] +------------------------------- +Pos 1: the_quick +Pos 2: quick_and +Pos 3: and_brown +Pos 4: brown +Pos 5: fox +------------------------------- + +It has stripped out all of the common word unigrams, leaving the common word +bigrams and the low frequency unigrams. Bigrams like `the_quick` are much +less common than the single term `the`. This has two advantages: + +* The positions data for `the_quick` is much smaller than for `the`, so it is + faster to read from disk and has less of an impact on the file system cache. + +* The term `the_quick` is much less common than `the`, so it drastically + decreases the number of documents that have to be examined. + +==== Two word phrases + +There is one further optimization. By far the majority of phrase queries +consist of only two words. If one of those words happens to be a common word, +such as: + +[source,json] +------------------------------- +GET /my_index/_search +{ + "query": { + "match_phrase": { + "text": { + "query": "The quick", + "analyzer": "search_grams" + } + } + } +} +------------------------------- + +then the `search_grams` analyzer outputs a single token: `the_quick`. This +transforms what originally could have been an expensive phrase query for `the` +and `quick` into a very efficient single term lookup. diff --git a/240_Stopwords/70_Relevance.asciidoc b/240_Stopwords/70_Relevance.asciidoc new file mode 100644 index 000000000..9d7b3f901 --- /dev/null +++ b/240_Stopwords/70_Relevance.asciidoc @@ -0,0 +1,72 @@ +[[stopwords-relavance]] +=== Stopwords and relevance + +The last topic to cover before moving on from stopwords is that of relevance. +Leaving stopwords in your index can potentially make the relevance calculation +less accurate, especially if your documents are very long. + +To understand why, consider how the relevance `_score` is calculated with +TF/IDF. The weight of a term depends on the interplay of these two factors: + +Term frequency:: + + The more often a word appears in the same document, the *more* relevant the + document. + +Inverse document frequency:: + + The more documents a word appears in, the *less* relevant the word. + +Usually, common terms have a very low weight because of their inverse document +frequency. However, globally common words are also locally common. The word +`the` may occur many times in the same field -- the longer the document, the +more often it appears. With TF/IDF, this increase in term frequency can +offset the low inverse document frequency, giving more weight to common words +than they deserve. + +Removing stopwords essentially solves this issue, but that comes with reduced +searchability. It is a poor solution to the problem. + +While TF/IDF is the default similarity algorithm used in Lucene, it is not the +only one. Lucene (and Elasticsearch) support a number of other similarity +algorithms, one of which is particularly pertinent to the problem described +above: BM25. + +The BM25 model is similar to TF/IDF, but also normalizes term frequencies. +What this means is that, unlike TF/IDF, the maximum weight contributed by term +frequency has a ceiling -- the more often a word appears in a document, the +more relevant is that term, but only up to a point. This essentially solves +the relevance issue with common words. + +It is possible to change the default similarity algorithm globally, by adding +the following to the `config/elasticsearch.yml` file: + +[source,yaml] +--------------------------- +index.similarity.default.type: BM25 +--------------------------- + +But similarity can also be configured on a field-by-field basis, when the +field is created. For instance: + +[source,json] +--------------------------- +PUT /my_index +{ + "mappings": { + "my_type": { + "properties": { + "text": { + "type": "string", + "similarity": "BM25" + } + } + } + } +} +--------------------------- + +TF/IDF is a tried and tested algorithm that has long served search well. BM25 +was developed in the 1970's and has been further refined since then. It is +considered to be a state of the art algorithm and is worth trying out if you +find that stopwords are adversely affecting the quality of your results.