Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

International geocoding #9

Open
ellenhp opened this issue Feb 19, 2024 · 7 comments
Open

International geocoding #9

ellenhp opened this issue Feb 19, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@ellenhp
Copy link
Owner

ellenhp commented Feb 19, 2024

I think going parserless will get us most of the way to reasonable international performance, but we'll need to ditch permute_road and friends. I'm thinking it makes the most sense to use the following algorithm for that, for each POI:

  1. Use OpenCage address formatting to create a text address for the POI. This happens after admin area population, which is important for libpostal language detection ("c/ de villarroel, barcelona" and "c/ de villarroel" have different expansions; the former uses catalan in addition to spanish)
  2. Call into libpostal's expand endpoint and collect the results
  3. For each expansion, perform the following steps:
  4. Call into libpostal's parse endpoint and collect the parsed tokens
  5. For each parsed token, substitute all possible abbreviations across all languages. e.g. "Saint Louis" will become ["St Louis", "Saint Louis"] because of the English personal_titles substitution dictionary. "carrer de villarroel" will become ["c/ de villarroel", "carrer de villarroel"]
  6. Index each of the substituted tokens

I think this is a reasonable way to do international permute_street but it means we'll need to perform numex at query time, because e.g. "third ave" will become 3rd ave" during indexing, and we need to match that behavior if we want to match the correct documents.

We'll need to modify the dictionary substitution behavior to add the empty string for each street_type substitution for languages where it's appropriate, otherwise "fremont ave" won't match "fremont ave n"

TODO: strasse suffixes and whatnot will need to be handled too. I might look into the libpostal codebase to see if there's anything we can reuse there.

@ellenhp ellenhp added the enhancement New feature or request label Feb 19, 2024
@ellenhp
Copy link
Owner Author

ellenhp commented Feb 21, 2024

Okay I ended up doing this way differently than expected. Namely I rely on OpenStreetMap to use the unabbreviated street names, then I go through and detect the language with lingua-rs (should use WOF official/spoken languages instead) and then do the substitutions based on that language. There are some issues. Anecdotally it seems to work ok for spanish and catalan street names but I haven't had much luck with french or german street names.

@do-me
Copy link

do-me commented Feb 11, 2025

Where did you get all those language-specific substitutions from?

@ellenhp
Copy link
Owner Author

ellenhp commented Feb 11, 2025

The dictionaries are from libpostal, looks like they're unattributed which is my bad. I'll fix it real quick.

@ellenhp
Copy link
Owner Author

ellenhp commented Feb 11, 2025

Fixed in 237fedd

@ellenhp
Copy link
Owner Author

ellenhp commented Feb 11, 2025

At some point I did start using WOF spoken languages for abbreviations by the way! So no more lingua-rs false classifications. These were especially common in spot checking in California where many of the streets and places have Spanish names. This led to some interesting cases where street names weren't properly abbreviated in the Airmail index and a search for these streets would yield no results or suboptimal results if the search query used their abbreviated names.

@do-me
Copy link

do-me commented Feb 11, 2025

That's great!
Not sure whether it's related, but one thing I noted for German is that e.g. "johannes-kerscht-str" is not found while "johannes-kerscht-straße" or "strasse" works. Not sure whether the stemmer kicks in here or the libpostal abbreviations are not working 100%?

@ellenhp
Copy link
Owner Author

ellenhp commented Feb 11, 2025

It's been a while since I've put development time into this project but I suspect that it tokenizes by spaces and only applies substitutions if there's a full token match. Not ideal especially for French and German, probably others too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants