Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Open
Hagar-Usama opened this issue May 16, 2024 · 3 comments

Comments

@Hagar-Usama
Copy link

Hello,

Does the algorithm support fuzzy matching between two non-English strings? I'm trying to match two Arabic records but returns 0 matches even with identical documents. Is this feasible, or are there any workarounds available?

Thank you!

@manishobhatia
Copy link
Contributor

Hi Hagar,
Not out of the box. It uses Apache Soundex library under the hood to find similarity. This open source library works only on english unfortunately.
If you know of a similar open source library for Arabic, it is relatively easy to plug it in. Will be happy to discuss a way to contribute to this project

Thanks

@Hagar-Usama
Copy link
Author

Hello @manishobhatia ,

Thank you for your prompt response.

We can still use Soundex by employing the Transliterator from com.ibm.icu.text. This package transliterates text from one format to another. We can leverage this capability by converting any element value into its transliterated form (regardless of the original language, we actually needn't to check).

Whether the matching elements are both non-English or one is English and the other is non-English, Soundex will still work. I propose extending the element class to include another attribute for the transliterated value, which will be used for matching.

It's worth mentioning that the transliterated value for an English word should remain the same (since the conversion is from English to English).

I'd be happy to give this a try and submit a pull request if the concept is approved.

Thanks!

@manishobhatia
Copy link
Contributor

Hi Hagar,

I like the idea, and open to looking at your pull request and get it part of the library.

If the idea is to compare one element in english to another in non-english, I think we can extend the pre-processing logic of element class and apply translation there.
Look at this method, which applies a given pre-processor.

This pre-processing function can be applied externally using the setter in builder class (see here ).

If this works for you, we can include this as one of the many in-build functions provided here.

Thanks

Hagar-Usama added a commit to Hagar-Usama/fuzzy-matcher that referenced this issue Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants