Use match instead of match_phrase query for autocomplete #1432

orangejulius · 2020-05-12T03:57:15Z

This is a small but meaningful change I've been meaning to make to our autocomplete queries for a while.

The primary must query in many of our autocomplete queries is a match_phrase query. This ends up being a little too strict, since match_phrase requires that every token must be present.

This PR replaces that query with a match query, and uses minimum_should_match to allow some tokens to be missing when there are 3 or more tokens in the input. The original match_phrase query is moved to the should clause to provide an additional scoring boost when all tokens are matching and in the correct order.

The main effect of this PR is increasing the number of queries that will return results.

For example: /v1/autocomplete?text=3929 saint Marks Avenue, Niagara Falls, ON, Canada.

Because we only contract abbreviations when indexing, this query (which includes the word saint when looking for an address that uses the abbreviation st in our data) was finding no results. Now it not only returns the correct result, but several variations with typos or incorrect words now also return the correct result.

Some examples that now all return the correct result at least in the top 10:

/v1/autocomplete?text=3929 st mark Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?text=3929 st foo Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?focus.point.lat=43.103698780971875&focus.point.lon=-79.02009080158956&text=3929 st Mark Avenue

While I didn't see any major regressions comparing our autocomplete test suites for this PR, we should do another quick round of testing before merging this.

This sort of change is a prerequisite for substantial work on any sort of fuzzy matching or typo correction since the Elasticsearch fuzziness options are only possible on a match query.

orangejulius · 2020-06-01T16:02:24Z

Here's an example of a downside for this PR:

when querying for 85 GREY mountain drive, sedona, az (note: input is grey, street name is gray), we see several not-quite-matching results:

As shown here, some of the incorrect matches even score higher than the desired result.

One way we might be able to fix this is some post-processing after queries in the API that would attempt to prune out clearly non-matching records. This could be useful in lots of different cases actually.

orangejulius · 2020-06-01T16:06:27Z

I just realized I actually had a misspelling in my example above, which happened to both show off some positives, and some negatives, of this PR. It allowed us to return the correct result even with slight difference in the input and address record (grey vs gray).

Spelling the streetname as it is in the data, we see the more common way this would likely manifest, with lots of non-matching housenumbers, and a non-matching street coming in after the expected result:

`match_phrase` is now a should query

missinglink · 2020-06-04T11:40:23Z

I noticed that queries for ordinal streets (such as 1st st) can become noisier with this PR:

The last 3 results are undesirable because they aren't what the user requested:
ie. 5 E 10th St != 5 E 5th St
..also 5 East Street is not correct for this input either.

orangejulius · 2020-06-09T23:10:41Z

I continue to do bits of analysis on the impact of this PR. Most queries are noisier with this, which is expected. We'll want to do some analysis of the performance impact on the increased number of documents that will match and have to be scored. It might be minor, or there might be some new cases that are problematic.

Another minor but interesting example of changes from this PR: it will effectively increase the relative amount of scoring that comes from text matches, rather than popularity or population.

Here's an example of an autocomplete query for Kansas City. Before, the more popular (and populated) city of Kansas City, MO came first. As it happens, due to a bug, the Kansas City, KS record had a slight boost to the text match. The boost was amplified by this PR enough to swap the results.

While this particular issue will be fixed, we should keep in mind that by adding another scoring query based on text match, we are essentially "devaluing" the focus point/population/popularity component of scoring. Over in #1205 we discuss the idea of moving from an "additive" scoring model (text + focus point + population + popularity) to a "multiplicative" scoring model (text * focus point * population * popularity), which might help avoid this problem someday.

Just for posterity, here's the change in scoring that resulted in the behavior I saw. Before, the population component accounted for 55% of the highest ranking document's score, whereas after, it's only 39%.

orangejulius · 2020-08-11T15:06:50Z

This PR has already somehow grown quite old!

Since then we've had the incredible #1296 come in to start using multi-match queries for autocomplete to improve support for multiple languages. I think this PR will require some reworking to support that.

What I'd actually like to do is take the pattern explored in this PR (where there is a loose match query in a must, and a strict match_phrase query in a should), and create a view for it in pelias/query, so that we can easily reuse it throughout Pelias. We might want both a match and multi_match variant.

orangejulius · 2020-08-13T16:22:26Z

query/autocomplete_defaults.js

@@ -20,6 +20,7 @@ module.exports = _.merge({}, peliasQuery.defaults, {
  'ngram:field': 'name.default',
  'ngram:boost': 100,
  'ngram:cutoff_frequency': 0.01,
+  'ngram:minimum_should_match': '1<-1 3<-25%',


This setting was used just because it's what we've used before, we can and should potentially reconsider if its the best option. Also, as we mentioned, it might make sense to use different values in different conditions, for example if the parser believes its a venue vs autocomplete query.

The explanation of what it does and why is contained in #1333

blackmad · 2020-08-17T21:03:13Z

This change overall seems negative in my testing. It seems to REALLY want to make things into intersections.

some examples

/v1/autocomplete?lang=&size=10&focus.point.lat=37.78657&focus.point.lon=-122.42001999999998&text=Smoke Vape & Beyond
--> goes from perfect venue match to intersection

/v1/autocomplete?lang=&size=10&focus.point.lat=37.85597&focus.point.lon=-122.28915&text=Tenth Street & Grayson Street
--> goes from perfect intersection match to "label": "East Grayson Street & Grayson Street, Mexia, TX, USA"

/v1/autocomplete?lang=&size=10&focus.point.lat=37.76429&focus.point.lon=-122.46603&text=9th+%26+Irving
--> was perfect match, now different intersection match 130km away from focus

/v1/autocomplete?lang=&size=10&focus.point.lat=37.77306&focus.point.lon=-122.4734&text=15th Avenue & Fulton Street
--> was perfect now, now does out-of-order on the street suffixes to "15th Street & Fulton Street, Tell City, IN, USA" many KM away and out of order

Out of 500 venue & intersection queries I saw ~60 diffs, with no wins in the first 20

orangejulius requested a review from missinglink May 12, 2020 15:00

orangejulius mentioned this pull request Jun 1, 2020

Synonyms for grey vs gray pelias/schema#443

Open

orangejulius force-pushed the autocomplete-match-and-match_phrase branch from af5c822 to 946061e Compare June 1, 2020 23:05

This was referenced Jun 1, 2020

[Experiment] Disable parser for short queries #1439

Closed

Autocomplete: stricter housenumber matching #1308

Closed

Use match instead of match_phrase query for autocomplete

095ab2f

`match_phrase` is now a should query

orangejulius force-pushed the autocomplete-match-and-match_phrase branch from 946061e to 095ab2f Compare June 2, 2020 22:54

adelcasse mentioned this pull request Jun 4, 2020

Enable interpolation on autocomplete pelias/interpolation#131

Open

orangejulius mentioned this pull request Jun 26, 2020

Street names should not require exact directional and street type matches pelias/pelias#740

Open

orangejulius commented Aug 13, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use match instead of match_phrase query for autocomplete #1432

Use match instead of match_phrase query for autocomplete #1432

orangejulius commented May 12, 2020 •

edited

Loading

orangejulius commented Jun 1, 2020 •

edited

Loading

orangejulius commented Jun 1, 2020 •

edited

Loading

missinglink commented Jun 4, 2020 •

edited

Loading

orangejulius commented Jun 9, 2020 •

edited

Loading

orangejulius commented Aug 11, 2020

orangejulius Aug 13, 2020 •

edited

Loading

blackmad commented Aug 17, 2020

Use match instead of match_phrase query for autocomplete #1432

Are you sure you want to change the base?

Use match instead of match_phrase query for autocomplete #1432

Conversation

orangejulius commented May 12, 2020 • edited Loading

orangejulius commented Jun 1, 2020 • edited Loading

orangejulius commented Jun 1, 2020 • edited Loading

missinglink commented Jun 4, 2020 • edited Loading

orangejulius commented Jun 9, 2020 • edited Loading

orangejulius commented Aug 11, 2020

orangejulius Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

blackmad commented Aug 17, 2020

orangejulius commented May 12, 2020 •

edited

Loading

orangejulius commented Jun 1, 2020 •

edited

Loading

orangejulius commented Jun 1, 2020 •

edited

Loading

missinglink commented Jun 4, 2020 •

edited

Loading

orangejulius commented Jun 9, 2020 •

edited

Loading

orangejulius Aug 13, 2020 •

edited

Loading