The smart-match module contains functions for calculating strings/sets similarity.
-
similarity: A value in a range of [0, 1], which represents how similar the two strings are. The larger the value, the more similar the two strings are.
-
dissimilarity: A value in a range of [0, 1], which represents how dissimilar the two strings are. The larger the value, the more dissimilar the two strings are. For a pair of strings, similarity = 1 - dissimilarity
-
distance: How far the two strings are. Notice that not all the methods support distance method.
-
score The larger the score, the more similar the two strings are. Notice not all the methods have score method.
We support three levels of string matching.
-
char: Similarity computation based on characters in the strings.
-
term: Similarity computation based on terms in the strings.
-
gram: Similarity computation based on q-grams in the strings.
We support the following methods.
Abbreviation | Full name | similarity | dissimilarity | distance | score |
---|---|---|---|---|---|
LE(Default) | Levenshtein | ✅ | ✅ | ✅ | ❌ |
ED | EuclideanDistance | ✅ | ✅ | ✅ | ❌ |
DL | Damerau Levenshtein | ✅ | ✅ | ✅ | ❌ |
BD | Block Distance | ✅ | ✅ | ✅ | ❌ |
cos | Cosine Similarity | ✅ | ✅ | ❌ | ❌ |
TC | TanimotoCoefficient | ✅ | ✅ | ❌ | ❌ |
dice | Dice Similarity | ✅ | ✅ | ❌ | ❌ |
simon | SimonWhite | ✅ | ✅ | ❌ | ❌ |
LCST | LongestCommonSubstring | ✅ | ✅ | ✅ | ✅ |
LCSQ | LongestCommonSubSequence | ✅ | ✅ | ✅ | ✅ |
OC | OverlapCoefficient | ✅ | ✅ | ❌ | ❌ |
GOC | GeneralizedOverlapCoefficient | ✅ | ✅ | ❌ | ❌ |
jac | Jaccard | ✅ | ✅ | ❌ | ❌ |
gjac | GeneralizedJaccard | ✅ | ✅ | ❌ | ❌ |
HD | HammingDistance | ✅ | ✅ | ✅ | ❌ |
jaro | Jaro | ✅ | ✅ | ❌ | ❌ |
JW | JaroWinkler | ✅ | ✅ | ❌ | ❌ |
NW | NeedlemanWunch | ✅ | ✅ | ❌ | ✅ |
SW | SmithWaterman | ✅ | ✅ | ❌ | ✅ |
SWG | SmithWatermanGotoh | ✅ | ✅ | ❌ | ✅ |
MK | MongeElkan | ✅ | ✅ | ❌ | ❌ |
pip install smart-match
import smart_match
print(smart_match.similarity('hello', 'hero'))
print(smart_match.dissimilarity('hello', 'hero'))
print(smart_match.distance('hello', 'hero'))
Output:
0.6
0.4
2
Check Wiki for more details.
smart-match is a free software. See the file LICENSE for the full text.