-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77
Comments
Hi Hagar, Thanks |
Hello @manishobhatia , Thank you for your prompt response. We can still use Soundex by employing the Transliterator from com.ibm.icu.text. This package transliterates text from one format to another. We can leverage this capability by converting any element value into its transliterated form (regardless of the original language, we actually needn't to check). Whether the matching elements are both non-English or one is English and the other is non-English, Soundex will still work. I propose extending the element class to include another attribute for the transliterated value, which will be used for matching. It's worth mentioning that the transliterated value for an English word should remain the same (since the conversion is from English to English). I'd be happy to give this a try and submit a pull request if the concept is approved. Thanks! |
Hi Hagar, I like the idea, and open to looking at your pull request and get it part of the library. If the idea is to compare one element in english to another in non-english, I think we can extend the pre-processing logic of element class and apply translation there. This pre-processing function can be applied externally using the setter in builder class (see here ). If this works for you, we can include this as one of the many in-build functions provided here. Thanks |
Hello,
Does the algorithm support fuzzy matching between two non-English strings? I'm trying to match two Arabic records but returns 0 matches even with identical documents. Is this feasible, or are there any workarounds available?
Thank you!
The text was updated successfully, but these errors were encountered: