-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguation removal logic truncates remaining search text #162
Comments
I believe this is correct as-is. So in the case of "Borivali (West), Mumbai, India" we are only looking for 'Borivali', the additional tokens which help localize it to Mumbai India shouldn't be included in the index. The associations to Mumbai & India should be made via their hierarchical links instead, so that we understand the parent-child relationship of these tokens. Can you provide an example of a query which is currently failing due to this? |
How we have it currently allows us to show a clear hierarchy of the tokens:
|
Because of the truncation, searching for "Portland (Oregon) USA" yields match from Jamaica as well. And searching for "Borivali (East), MH, India" yields Borivali West as the first match. "3 Store, 311-318 High Holborn, London, WC1V 7BN, UK" returns no matches. Instead of returning the following (screenshot taken from a modified instance where I removed the disambiguation regex) Similarly, "1313 1/2 Railroad Ave Bellingham WA 98225-4729" returns no matches. "St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ" returns a wrong result. "〒100-8994, 東京都 中央区 八重洲一丁目 5番3号 東京中央郵便局, Japan" returns no result. There may be some more examples. I took some here from Falsehoods. The main problem I see is that truncating at a disambiguation character removes all trailing address information - the lineage - which is crucial in determining the location. |
BTW, what I did was replace these characters with a space.
My guess is that giving more tokens to Placeholder, will allow it to perform a better match. And it seems to be working well with it. |
Oh I see we were talking about slightly different topics. The original intention of the regex was to fix erroneous data at import-time. It seems we are using the same analysis at query-time that we're using at index-time and so maybe you're right, we might consider making them separate analyzers so they can have different functions. Thanks for the examples, they are certainly helpful, although I don't expect us to be able to handle all the edge cases from that Falsehoods post because this library doesn't have any awareness of addresses. |
Awesome. And sure, I wouldn't expect Placeholder to handle different address oddities. Placeholder should stay focused on "last line parsing". |
Tokenization has a regex that removes disambiguation markers. This can be helpful, but it's currently truncating remaining text after the first occurrence of any disambiguation character.
So:
"Portland (Oregon) USA" becomes "portland"
"Borivali (West), Mumbai, India" becomes "borivali"
etc
Should it just remove the disambiguation part and leave the rest of the text as it is? (I understand that's going to be impossible where we don't have a "closing" disambiguation marker - e.g. just a simple "-")
Or is it just easier to remove the marker characters only and leave the disambiguation text as it is?
For example, change the regex to
input.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ');
With this:
"Portland (Oregon) USA" becomes "portland oregon usa"
"Borivali (West), Mumbai, India" becomes "borivali west india"
Referring to this:
d7de4c9#diff-b1c9f1b1a4d867ea6fd37744bd1b38e5
The text was updated successfully, but these errors were encountered: