Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use match instead of match_phrase query for autocomplete #1432

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented May 12, 2020

This is a small but meaningful change I've been meaning to make to our autocomplete queries for a while.

The primary must query in many of our autocomplete queries is a match_phrase query. This ends up being a little too strict, since match_phrase requires that every token must be present.

This PR replaces that query with a match query, and uses minimum_should_match to allow some tokens to be missing when there are 3 or more tokens in the input. The original match_phrase query is moved to the should clause to provide an additional scoring boost when all tokens are matching and in the correct order.

The main effect of this PR is increasing the number of queries that will return results.

For example: /v1/autocomplete?text=3929 saint Marks Avenue, Niagara Falls, ON, Canada.
Screenshot_2020-05-11_20-44-31

Because we only contract abbreviations when indexing, this query (which includes the word saint when looking for an address that uses the abbreviation st in our data) was finding no results. Now it not only returns the correct result, but several variations with typos or incorrect words now also return the correct result.

Some examples that now all return the correct result at least in the top 10:

/v1/autocomplete?text=3929 st mark Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?text=3929 st foo Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?focus.point.lat=43.103698780971875&focus.point.lon=-79.02009080158956&text=3929 st Mark Avenue

While I didn't see any major regressions comparing our autocomplete test suites for this PR, we should do another quick round of testing before merging this.

This sort of change is a prerequisite for substantial work on any sort of fuzzy matching or typo correction since the Elasticsearch fuzziness options are only possible on a match query.

@orangejulius orangejulius requested a review from missinglink May 12, 2020 15:00
@orangejulius
Copy link
Member Author

orangejulius commented Jun 1, 2020

Here's an example of a downside for this PR:

when querying for 85 GREY mountain drive, sedona, az (note: input is grey, street name is gray), we see several not-quite-matching results:
Screenshot_2020-06-01_09-00-44

As shown here, some of the incorrect matches even score higher than the desired result.

One way we might be able to fix this is some post-processing after queries in the API that would attempt to prune out clearly non-matching records. This could be useful in lots of different cases actually.

@orangejulius
Copy link
Member Author

orangejulius commented Jun 1, 2020

I just realized I actually had a misspelling in my example above, which happened to both show off some positives, and some negatives, of this PR. It allowed us to return the correct result even with slight difference in the input and address record (grey vs gray).

Spelling the streetname as it is in the data, we see the more common way this would likely manifest, with lots of non-matching housenumbers, and a non-matching street coming in after the expected result:
Screenshot_2020-06-01_09-04-42

`match_phrase` is now a should query
@orangejulius orangejulius force-pushed the autocomplete-match-and-match_phrase branch from 946061e to 095ab2f Compare June 2, 2020 22:54
@missinglink
Copy link
Member

missinglink commented Jun 4, 2020

I noticed that queries for ordinal streets (such as 1st st) can become noisier with this PR:

Screenshot 2020-06-04 at 13 37 29

The last 3 results are undesirable because they aren't what the user requested:
ie. 5 E 10th St != 5 E 5th St
..also 5 East Street is not correct for this input either.

@orangejulius
Copy link
Member Author

orangejulius commented Jun 9, 2020

I continue to do bits of analysis on the impact of this PR. Most queries are noisier with this, which is expected. We'll want to do some analysis of the performance impact on the increased number of documents that will match and have to be scored. It might be minor, or there might be some new cases that are problematic.

Another minor but interesting example of changes from this PR: it will effectively increase the relative amount of scoring that comes from text matches, rather than popularity or population.

Here's an example of an autocomplete query for Kansas City. Before, the more popular (and populated) city of Kansas City, MO came first. As it happens, due to a bug, the Kansas City, KS record had a slight boost to the text match. The boost was amplified by this PR enough to swap the results.
Screenshot_2020-06-09_15-02-12

While this particular issue will be fixed, we should keep in mind that by adding another scoring query based on text match, we are essentially "devaluing" the focus point/population/popularity component of scoring. Over in #1205 we discuss the idea of moving from an "additive" scoring model (text + focus point + population + popularity) to a "multiplicative" scoring model (text * focus point * population * popularity), which might help avoid this problem someday.

Just for posterity, here's the change in scoring that resulted in the behavior I saw. Before, the population component accounted for 55% of the highest ranking document's score, whereas after, it's only 39%.
Screenshot_2020-06-09_16-14-12

@orangejulius
Copy link
Member Author

This PR has already somehow grown quite old!

Since then we've had the incredible #1296 come in to start using multi-match queries for autocomplete to improve support for multiple languages. I think this PR will require some reworking to support that.

What I'd actually like to do is take the pattern explored in this PR (where there is a loose match query in a must, and a strict match_phrase query in a should), and create a view for it in pelias/query, so that we can easily reuse it throughout Pelias. We might want both a match and multi_match variant.

@@ -20,6 +20,7 @@ module.exports = _.merge({}, peliasQuery.defaults, {
'ngram:field': 'name.default',
'ngram:boost': 100,
'ngram:cutoff_frequency': 0.01,
'ngram:minimum_should_match': '1<-1 3<-25%',
Copy link
Member Author

@orangejulius orangejulius Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting was used just because it's what we've used before, we can and should potentially reconsider if its the best option. Also, as we mentioned, it might make sense to use different values in different conditions, for example if the parser believes its a venue vs autocomplete query.

The explanation of what it does and why is contained in #1333

@blackmad
Copy link
Contributor

This change overall seems negative in my testing. It seems to REALLY want to make things into intersections.

some examples

/v1/autocomplete?lang=&size=10&focus.point.lat=37.78657&focus.point.lon=-122.42001999999998&text=Smoke Vape & Beyond
--> goes from perfect venue match to intersection

/v1/autocomplete?lang=&size=10&focus.point.lat=37.85597&focus.point.lon=-122.28915&text=Tenth Street & Grayson Street
--> goes from perfect intersection match to "label": "East Grayson Street & Grayson Street, Mexia, TX, USA"

/v1/autocomplete?lang=&size=10&focus.point.lat=37.76429&focus.point.lon=-122.46603&text=9th+%26+Irving
--> was perfect match, now different intersection match 130km away from focus

/v1/autocomplete?lang=&size=10&focus.point.lat=37.77306&focus.point.lon=-122.4734&text=15th Avenue & Fulton Street
--> was perfect now, now does out-of-order on the street suffixes to "15th Street & Fulton Street, Tell City, IN, USA" many KM away and out of order

Out of 500 venue & intersection queries I saw ~60 diffs, with no wins in the first 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants