Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Hagar-Usama · 2024-05-16T02:19:31Z

Hello,

Does the algorithm support fuzzy matching between two non-English strings? I'm trying to match two Arabic records but returns 0 matches even with identical documents. Is this feasible, or are there any workarounds available?

Thank you!

manishobhatia · 2024-05-16T04:51:32Z

Hi Hagar,
Not out of the box. It uses Apache Soundex library under the hood to find similarity. This open source library works only on english unfortunately.
If you know of a similar open source library for Arabic, it is relatively easy to plug it in. Will be happy to discuss a way to contribute to this project

Thanks

Hagar-Usama · 2024-05-16T17:44:18Z

Hello @manishobhatia ,

Thank you for your prompt response.

We can still use Soundex by employing the Transliterator from com.ibm.icu.text. This package transliterates text from one format to another. We can leverage this capability by converting any element value into its transliterated form (regardless of the original language, we actually needn't to check).

Whether the matching elements are both non-English or one is English and the other is non-English, Soundex will still work. I propose extending the element class to include another attribute for the transliterated value, which will be used for matching.

It's worth mentioning that the transliterated value for an English word should remain the same (since the conversion is from English to English).

I'd be happy to give this a try and submit a pull request if the concept is approved.

Thanks!

manishobhatia · 2024-05-22T22:29:33Z

Hi Hagar,

I like the idea, and open to looking at your pull request and get it part of the library.

If the idea is to compare one element in english to another in non-english, I think we can extend the pre-processing logic of element class and apply translation there.
Look at this method, which applies a given pre-processor.

This pre-processing function can be applied externally using the setter in builder class (see here ).

If this works for you, we can include this as one of the many in-build functions provided here.

Thanks

Hagar-Usama added a commit to Hagar-Usama/fuzzy-matcher that referenced this issue Jun 1, 2024

intuit#77 add cross language feature in NAME with unit tests

f4d5091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Hagar-Usama commented May 16, 2024

manishobhatia commented May 16, 2024

Hagar-Usama commented May 16, 2024

manishobhatia commented May 22, 2024

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches #77

Comments

Hagar-Usama commented May 16, 2024

manishobhatia commented May 16, 2024

Hagar-Usama commented May 16, 2024

manishobhatia commented May 22, 2024