-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use match instead of match_phrase query for autocomplete #1432
base: master
Are you sure you want to change the base?
Conversation
Here's an example of a downside for this PR: when querying for 85 GREY mountain drive, sedona, az (note: input is As shown here, some of the incorrect matches even score higher than the desired result. One way we might be able to fix this is some post-processing after queries in the API that would attempt to prune out clearly non-matching records. This could be useful in lots of different cases actually. |
af5c822
to
946061e
Compare
`match_phrase` is now a should query
946061e
to
095ab2f
Compare
I continue to do bits of analysis on the impact of this PR. Most queries are noisier with this, which is expected. We'll want to do some analysis of the performance impact on the increased number of documents that will match and have to be scored. It might be minor, or there might be some new cases that are problematic. Another minor but interesting example of changes from this PR: it will effectively increase the relative amount of scoring that comes from text matches, rather than popularity or population. Here's an example of an autocomplete query for Kansas City. Before, the more popular (and populated) city of Kansas City, MO came first. As it happens, due to a bug, the While this particular issue will be fixed, we should keep in mind that by adding another scoring query based on text match, we are essentially "devaluing" the focus point/population/popularity component of scoring. Over in #1205 we discuss the idea of moving from an "additive" scoring model (text + focus point + population + popularity) to a "multiplicative" scoring model (text * focus point * population * popularity), which might help avoid this problem someday. Just for posterity, here's the change in scoring that resulted in the behavior I saw. Before, the population component accounted for 55% of the highest ranking document's score, whereas after, it's only 39%. |
This PR has already somehow grown quite old! Since then we've had the incredible #1296 come in to start using multi-match queries for autocomplete to improve support for multiple languages. I think this PR will require some reworking to support that. What I'd actually like to do is take the pattern explored in this PR (where there is a loose |
@@ -20,6 +20,7 @@ module.exports = _.merge({}, peliasQuery.defaults, { | |||
'ngram:field': 'name.default', | |||
'ngram:boost': 100, | |||
'ngram:cutoff_frequency': 0.01, | |||
'ngram:minimum_should_match': '1<-1 3<-25%', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting was used just because it's what we've used before, we can and should potentially reconsider if its the best option. Also, as we mentioned, it might make sense to use different values in different conditions, for example if the parser believes its a venue vs autocomplete query.
The explanation of what it does and why is contained in #1333
This change overall seems negative in my testing. It seems to REALLY want to make things into intersections. some examples /v1/autocomplete?lang=&size=10&focus.point.lat=37.78657&focus.point.lon=-122.42001999999998&text=Smoke Vape & Beyond /v1/autocomplete?lang=&size=10&focus.point.lat=37.85597&focus.point.lon=-122.28915&text=Tenth Street & Grayson Street /v1/autocomplete?lang=&size=10&focus.point.lat=37.76429&focus.point.lon=-122.46603&text=9th+%26+Irving /v1/autocomplete?lang=&size=10&focus.point.lat=37.77306&focus.point.lon=-122.4734&text=15th Avenue & Fulton Street Out of 500 venue & intersection queries I saw ~60 diffs, with no wins in the first 20 |
This is a small but meaningful change I've been meaning to make to our autocomplete queries for a while.
The primary
must
query in many of our autocomplete queries is amatch_phrase
query. This ends up being a little too strict, sincematch_phrase
requires that every token must be present.This PR replaces that query with a
match
query, and usesminimum_should_match
to allow some tokens to be missing when there are 3 or more tokens in the input. The originalmatch_phrase
query is moved to theshould
clause to provide an additional scoring boost when all tokens are matching and in the correct order.The main effect of this PR is increasing the number of queries that will return results.
For example: /v1/autocomplete?text=3929 saint Marks Avenue, Niagara Falls, ON, Canada.
Because we only contract abbreviations when indexing, this query (which includes the word
saint
when looking for an address that uses the abbreviationst
in our data) was finding no results. Now it not only returns the correct result, but several variations with typos or incorrect words now also return the correct result.Some examples that now all return the correct result at least in the top 10:
/v1/autocomplete?text=3929 st mark Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?text=3929 st foo Avenue, Niagara Falls, ON, Canada
/v1/autocomplete?focus.point.lat=43.103698780971875&focus.point.lon=-79.02009080158956&text=3929 st Mark Avenue
While I didn't see any major regressions comparing our autocomplete test suites for this PR, we should do another quick round of testing before merging this.
This sort of change is a prerequisite for substantial work on any sort of fuzzy matching or typo correction since the Elasticsearch fuzziness options are only possible on a
match
query.