SFSimilarity

Similarity and Distance functions for Snowflake

Various different mechanisms of calculating similarity scores as well as edit distances between Strings.

The list of edit distances that we currently support follow:

Cosine Distance,
Hamming Distance,
Jaccard Distance,
Jaro Winkler Distance,
Levenshtein Distance,
Longest Common Subsequence Distance,

and the list of similarity scores that we support follows:

Cosine Similarity,
Fuzzy Score Similarity,
Jaccard Similarity,
Jaro-Winkler Similarity, and
Longest Common Subsequence Similarity

Note:

The difference between a "similarity score" and a "distance function" is that a distance functions meets the following qualifications:

d(x,y) >= 0, non-negativity or separation axiom
d(x,y) == 0, if and only if, x == y
d(x,y) == d(y,x), symmetry, and
d(x,z) <= d(x,y) + d(y,z), the triangle inequality

Whereas a "similarity score" need not satisfy all such properties. Though, it is fairly easy to "normalize" a similarity score to manufacture an "edit distance."

Installation:

Before creating the UDFs in Snowflake you need to upload the sfsimilarity-1.0.jar, commons-lang3-3.12.0.jar and commons-text-1.9.jar, download the binaries from the Latest Release.

First create a stage (or use an existing one) in Snowflake:

CREATE STAGE SFSimilarity 
 COMMENT = 'Similarity and Distance functions for Snowflake';

Load the Jars to the Snowflake stage (for example @SFSimilarity) using Snowsql:

put file:///Users/me/Downloads/sfsimilarity-1.0.jar @SFSimilarity/ AUTO_COMPRESS = FALSE OVERWRITE = TRUE;
put file:///Users/me/Downloads/commons-lang3-3.12.0.jar @SFSimilarity/ AUTO_COMPRESS = FALSE OVERWRITE = TRUE;
put file:///Users/me/Downloads/commons-text-1.9.jar @SFSimilarity/ AUTO_COMPRESS = FALSE OVERWRITE = TRUE;

Create the UDFs using the SQL from the source code here: https://github.com/Snowflake-Labs/SFSimilarity/blob/main/src/main/sql/SFSimilarity.sql

Examples:

Run the following examples to see how the functions work: https://github.com/Snowflake-Labs/SFSimilarity/blob/main/src/main/sql/SFSimilarity_Examples.sql

Compiling from source:

To compile from source, first clone this repository, then

Build the jar for the UDF functions using Maven:

mvn package

Additional Resources:

Underlying package java docs: https://commons.apache.org/proper/commons-text/userguide.html

A good walkthrough of the algorithms with examples: https://apothem.blog/apache-commons-text.html#string-similarity

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
lib		lib
src		src
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFSimilarity

Similarity and Distance functions for Snowflake

Installation:

Examples:

Compiling from source:

Additional Resources:

About

Releases

Packages

Languages

License

sfc-gh-lleszewski/SFSimilarity

Folders and files

Latest commit

History

Repository files navigation

SFSimilarity

Similarity and Distance functions for Snowflake

Installation:

Examples:

Compiling from source:

Additional Resources:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages