You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some false positives when inputting gibberish, Lingua identifies them as languages when it should return None.
Examples: vszzc hvwg wg zcbu hslh 5HeQsKSTseGZrDvdCAUYr6DyxS5jy4953UWACh9bN2rUFkj2sDuY3BS VGhpcyBpcyBhbiBleGFtcGxlIG9mIGJhc2U2NA== KZDWQ4DDPFBHAY3ZIJUGE2KCNRSUORTUMNDXQ3CJI44W2SKHJJUGGMSVGJHECPJ5
The project I'm working on has a lot of gibberish. We need to identify between different languages and gibberish. I've been looking for solutions but I'm not an expert at NLP.
I'd like your opinion on what the best solution for that use case would be.
The text was updated successfully, but these errors were encountered:
Hi @SkeletalDemise, thank you for reaching out to me. Currently, Lingua is not able to identify gibberish. It sums up probabilities for letter sequences (= ngrams) learned from training data for each supported language. Even the ngrams in gibberish have a certain probability and Lingua simply returns the language with the highest probability. So it's not that easy to identify gibberish. But I will think about how to solve this as it is a pretty interesting problem.
There are some false positives when inputting gibberish, Lingua identifies them as languages when it should return None.
Examples:
vszzc hvwg wg zcbu hslh
5HeQsKSTseGZrDvdCAUYr6DyxS5jy4953UWACh9bN2rUFkj2sDuY3BS
VGhpcyBpcyBhbiBleGFtcGxlIG9mIGJhc2U2NA==
KZDWQ4DDPFBHAY3ZIJUGE2KCNRSUORTUMNDXQ3CJI44W2SKHJJUGGMSVGJHECPJ5
The project I'm working on has a lot of gibberish. We need to identify between different languages and gibberish. I've been looking for solutions but I'm not an expert at NLP.
I'd like your opinion on what the best solution for that use case would be.
The text was updated successfully, but these errors were encountered: