The smart-match module contains functions for calculating strings/sets similarity.
-
similarity: A value in a range of [0, 1], which represents how similar the two strings are. The larger the value, the more similar the two strings are.
-
dissimilarity: A value in a range of [0, 1], which represents how dissimilar the two strings are. The larger the value, the more dissimilar the two strings are. For a pair of strings, similarity = 1 - dissimilarity
-
distance: How far the two strings are. Notice that not all the methods support distance method.
-
score The larger the score, the more similar the two strings are. Notice not all the methods have score method.
We support three levels of string matching.
-
char: Similarity computation based on characters in the strings.
-
term: Similarity computation based on terms in the strings.
-
gram: Similarity computation based on q-grams in the strings.
We support the following methods.
Method | similarity | dissimilarity | distance | score |
---|---|---|---|---|
Levenshtein (default) | ✅ | ✅ | ✅ | ❌ |
Euclidean | ✅ | ✅ | ✅ | ❌ |
Damerau Levenshtein | ✅ | ✅ | ✅ | ❌ |
Block Distance | ✅ | ✅ | ✅ | ❌ |
Cosine | ✅ | ✅ | ❌ | ❌ |
Tanimoto Coefficient | ✅ | ✅ | ❌ | ❌ |
Dice | ✅ | ✅ | ❌ | ❌ |
Simon White | ✅ | ✅ | ❌ | ❌ |
Longest Common Substring | ✅ | ✅ | ✅ | ✅ |
Longest Common SubSequence | ✅ | ✅ | ✅ | ✅ |
Overlap Coefficient | ✅ | ✅ | ❌ | ❌ |
Generalized Overlap Coefficient | ✅ | ✅ | ❌ | ❌ |
Jaccard | ✅ | ✅ | ❌ | ❌ |
Generalized Jaccard | ✅ | ✅ | ❌ | ❌ |
Hamming | ✅ | ✅ | ✅ | ❌ |
Jaro | ✅ | ✅ | ❌ | ❌ |
Jaro Winkler | ✅ | ✅ | ❌ | ❌ |
Needleman Wunch | ✅ | ✅ | ❌ | ✅ |
Smith Waterman | ✅ | ✅ | ❌ | ✅ |
Smith Waterman Gotoh | ✅ | ✅ | ❌ | ✅ |
Monge Elkan | ✅ | ✅ | ❌ | ❌ |
pip install smart-match
import smart_match
print(smart_match.similarity('hello', 'hero'))
print(smart_match.dissimilarity('hello', 'hero'))
print(smart_match.distance('hello', 'hero'))
Output:
0.6
0.4
2
Check Wiki for more details.
smart-match is a free software. See the file LICENSE for the full text.