Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autocomplete: Multi-lang search (based on user's lang) #1296

Open
2 of 6 tasks
Joxit opened this issue May 14, 2019 · 19 comments
Open
2 of 6 tasks

Autocomplete: Multi-lang search (based on user's lang) #1296

Joxit opened this issue May 14, 2019 · 19 comments

Comments

@Joxit
Copy link
Member

Joxit commented May 14, 2019

Transcription of #127 (comment)

What is this for ?

We want Pelias to send responses to queries written in other languages than English.
For example, a Dutch looking for Parijs (Paris in Dutch) will get Parijs, Frankrijk.

What should we do ?

Some use cases

Text Lang Result Status
Parijs nl Parijs, Frankrijk (whosonfirst:locality:101751119 Paris) KO
Londre fr Londres, Angleterre, Royaume-Uni (whosonfirst:locality:101750367 London) KO
ブラジル ja ブラジル (whosonfirst:country:85633009 Brazil) OK

cc @mihneadb Have you started working on it ? I can take the task if you want 😄

@mihneadb
Copy link
Contributor

mihneadb commented May 15, 2019

@Joxit I haven't, I was waiting for some clarifications in the other issue and I started working on some other stuff now. So sure, go ahead and take it, thanks a ton for doing this! :) 🥂

@Joxit
Copy link
Member Author

Joxit commented May 20, 2019

It seems that in order to validate this issue, all the importers must support the multi-lang index.

At this time, only OSM supports it.
WOF will be supported with pelias/whosonfirst#446
Geonames needs alternateNamesV2 file to add multi-lang (we want that ?)
OpenAddresses and Polylines are unavailable

I think, the most important importer is WOF, the city/country search is the most common use case of the geocoder.

@mihneadb
Copy link
Contributor

mihneadb commented May 20, 2019

@Joxit FWIW there seems to be some level of support for that already - when looking for something, passing e.g. lang=en or lang=ru yields the same name but the city name is translated.

https://pelias.github.io/compare/#/v1/autocomplete%3Flang=en&text=red%20square%20moscow
vs
https://pelias.github.io/compare/#/v1/autocomplete%3Flang=ru&text=red%20square%20moscow
(see label)

I thought that data was based on WOF.

@Joxit
Copy link
Member Author

Joxit commented May 20, 2019

Yes, this is done by pelias/placeholder which is a middleware and translate ElasticSearch responses for the user (using WOF ids).
This issue is about ES requests (and not responses).
That means, when you use lang=ru and search red square Москва, you will not found the correct venue (geonames:venue:6295575).

The data is present in WOF, but not indexed in ES, only the default name and English variant are currently indexed. That's why I opened pelias/whosonfirst#446 😄

@mihneadb
Copy link
Contributor

Gotcha now, thanks! About that, I'm thinking we should also return Кра́сная пло́щадь if someone searches red square lang=ru, would you agree? I'm thinking this should be easier to achieve - building on what you pointed out about the middleware. I can make a PR if so.

@Joxit
Copy link
Member Author

Joxit commented May 20, 2019

I think the API can return the name.{lang} index when it's available in OSM, but for Geonames, it will be a bit more tricky because we do not use it anywhere.
Maybe this can be added in placeholder ? But we will have conflicts with WOF data...

@mihneadb
Copy link
Contributor

I was thinking about it at a higher level. Simplest seems to me to update geojsonify here: https://github.com/pelias/api/blob/master/helper/geojsonify.js#L55-L60

Instead of going for default, prioritize req.lang?

@mihneadb
Copy link
Contributor

mihneadb commented Aug 1, 2019

Hi everyone! Any update on this one? LMK if I can help some way.

@missinglink
Copy link
Member

Hi @mihneadb, unfortunately, it's me whos the blocker here, I would like to land #1287 before merging this (It's a complex change but I'm planning on doing the final testing and merging next week).

It's really not ideal to hold back another PR, especially a community contribution, but it makes sense for us in this case because the PR I linked is a massive refactoring of how autocomplete queries are generated.

We are sometimes a little over-cautious with merging big PRs but it's our responsibility to ensure compatibility and reliability for organisations running Pelias in a production environment with user-facing traffic.

@missinglink
Copy link
Member

Oh actually I thought this was another PR, but the same still applies to this one ;)

@mihneadb
Copy link
Contributor

mihneadb commented Aug 1, 2019

@missinglink thanks for the transparency! Looking forward to using the new parser! :)

@slvlirnoff
Copy link

@missinglink Any news on this?

@missinglink
Copy link
Member

I've been sick this week but releasing the new parser is a top priority.

@bboure
Copy link
Member

bboure commented Oct 5, 2020

Hi,

I came across an issue related to this today. I was looking for Edo Tokyo Museum and could not find any result. I realized that I had to search for 江戸東京博物館 in order to find it.

Any ETA for this feature?

@Joxit
Copy link
Member Author

Joxit commented Oct 5, 2020

Hi @bboure, this part of the feature is already live if you are running your query with lang=en. I found a difference in ES query between the English version and the Kanji version.

{
  "constant_score": {
    "filter": {
      "multi_match": {
        "type": "cross_fields",
        "query": "Museum",
        "fields": [
          "parent.country.ngram^1",
          "parent.dependency.ngram^1",
          "parent.macroregion.ngram^1",
          "parent.region.ngram^1",
          "parent.county.ngram^1",
          "parent.localadmin.ngram^1",
          "parent.locality.ngram^1",
          "parent.borough.ngram^1",
          "parent.neighbourhood.ngram^1",
          "parent.locality_a.ngram^1",
          "parent.region_a.ngram^1",
          "parent.country_a.ngram^1",
          "name.default^1.5"
        ],
        "analyzer": "peliasQuery"
      }
    }
  }
}

In the must clause, name.en^1.5 is missing.

The missing feature is multi lang in parent hierarchy now.

@bboure
Copy link
Member

bboure commented Oct 5, 2020

@Joxit Thanks for reaching back.

Add lang=en does not work either though. The query does not include name.en^1.5

https://pelias.github.io/compare/#/v1/autocomplete?layers=venue&lang=en&text=Edo+Tokyo+Museum&debug=1

Am I doing something wrong?

@bboure
Copy link
Member

bboure commented Oct 5, 2020

Interestingly, looking for Edo Tokyo Museum, Tokyo works

It has to do on how the query is built

Edo Tokyo Museum, Tokyo:

"must": [
                   {
                      "multi_match": {
                        "type": "phrase",
                        "query": "edo Tokyo Museum",
                        "fields": [
                          "phrase.default",
                          "phrase.en"
                        ],
                        "analyzer": "peliasQuery",
                        "boost": 1,
                        "slop": 3
                      }
                    },
                   {
                      "multi_match": {
                        "type": "cross_fields",
                        "query": "Tokyo",
                        "fields": [
                          "parent.country.ngram^1",
                          "parent.dependency.ngram^1",
                          "parent.macroregion.ngram^1",
                          "parent.region.ngram^1",
                          "parent.county.ngram^1",
                          "parent.localadmin.ngram^1",
                          "parent.locality.ngram^1",
                          "parent.borough.ngram^1",
                          "parent.neighbourhood.ngram^1",
                          "parent.locality_a.ngram^1",
                          "parent.region_a.ngram^1",
                          "parent.country_a.ngram^1",
                          "name.default^1.5"
                        ],
                        "analyzer": "peliasAdmin"
                      }
                    }
                  ],

The full text falls into the peliasQuery analyzer here, and Tokyo into peliasAdmin

Edo Tokyo Museum:

"must": [
                   {
                      "multi_match": {
                        "type": "phrase",
                        "query": "edo Tokyo",
                        "fields": [
                          "phrase.default",
                          "phrase.en"
                        ],
                        "analyzer": "peliasQuery",
                        "boost": 1,
                        "slop": 3
                      }
                    },
                   {
                      "constant_score": {
                        "filter": {
                          "multi_match": {
                            "type": "cross_fields",
                            "query": "Museum",
                            "fields": [
                              "parent.country.ngram^1",
                              "parent.dependency.ngram^1",
                              "parent.macroregion.ngram^1",
                              "parent.region.ngram^1",
                              "parent.county.ngram^1",
                              "parent.localadmin.ngram^1",
                              "parent.locality.ngram^1",
                              "parent.borough.ngram^1",
                              "parent.neighbourhood.ngram^1",
                              "parent.locality_a.ngram^1",
                              "parent.region_a.ngram^1",
                              "parent.country_a.ngram^1",
                              "name.default^1.5"
                            ],
                            "analyzer": "peliasQuery"
                          }
                        }
                      }
],

Museum here is separated into a second rule and missing name.en^1.5

@Joxit
Copy link
Member Author

Joxit commented Oct 5, 2020

Don't worry, I'm working on a fix, I will publish something tonight or tomorrow.

Yes, in autocomplete, the last token can be either a part of the subject (the venue) or the hierarchy. That's why we are using a cross_fields with both parent.* and name.default.

@bboure
Copy link
Member

bboure commented Oct 5, 2020

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants