From 5a85dded584769cc3d57731f6a1f5a34a9c1791c Mon Sep 17 00:00:00 2001 From: Konrad Rieck Date: Wed, 22 Oct 2014 11:35:58 +0200 Subject: [PATCH] updated tutorial text --- examples/TUTORIAL.md | 105 +++++++++++++++++++++++++++++-------------- src/hstring.c | 2 +- 2 files changed, 73 insertions(+), 34 deletions(-) diff --git a/examples/TUTORIAL.md b/examples/TUTORIAL.md index 7b90e37..87d15be 100644 --- a/examples/TUTORIAL.md +++ b/examples/TUTORIAL.md @@ -1,45 +1,46 @@ # A Small Tutorial - This page provides a small tutorial for using Harry and its options. We - will use the following text file as input data: - [`toydata.txt`](toydata.txt). The file contains only four lines and - thus we are looking at only four different strings in this tutorial. + This document provides a small tutorial for using Harry and its options. + We will use the following text file as input data: [`data.txt`](data.txt). + The file contains only four lines and thus we are looking at only four + different strings in this tutorial. ## Computing a Similarity Matrix - Let's start by simply computing a similarity matrix for the four strings. + Let's start by simply computing a similarity matrix for the four strings. The default similarity measure implemented in Harry is the Levenshtein distance (edit distance). We thus get a 4x4 matrix of distance values, if we run: - harry toydata.txt - + harry data.txt - Harry will print the matrix to standard out (stdout), since no output file is selected. Alternatively, we can write the matrix to a file by running: - harry toydata.txt matrix.txt + harry data.txt matrix.txt ## Different Similarity Measures The Levenshtein distance is only one of many similarity measures for - strings. To see a list of similarity measures implemented by Harry, run + strings. To see a list of similarity measures supported by Harry, run the following command. harry -M Note that distances start with `dist`, while kernel functions and - similarity coefficients are prefixed by `kern` and `sim`, respectively. + similarity coefficients are prefixed by `kern` and `sim`, respectively. The latter two compute the similarity of two strings, that is, the - returned value increases with the similarity of the strings, whereas for - distances it decreases. Let's have some fun and compute a couple of - different similarity matrices + returned value increases with the similarity of the strings. By contrast, + the distances compute the dissimilarity of two strings and thus the + returned value decreases with the similarity of the strings. Let's have + some fun and compute a couple of different similarity measures: - harry -m dist_hamming toydata.txt - - harry -m dist_jaro toydata.txt - - harry -m kern_spectrum toydata.txt - - harry -m sim_jaccard toydata.txt - + harry -m dist_hamming data.txt - + harry -m dist_jaro data.txt - + harry -m kern_spectrum data.txt - + harry -m sim_jaccard data.txt - - Each of the similarity measures emphasizes different aspects of the + Each of the similarity measures emphasizes different aspects of the strings. Just have a look at Wikipedia to learn a little bit about how these are computed. @@ -56,7 +57,7 @@ the different distances, kernel functions and similarity coefficients simply operate on words instead of characters. Try this command: - harry -d' ' toydata.txt - + harry -d ' ' data.txt - Note how the distances differ from the first example, where you compute the Levenshtein distance for the characters and not the words. You can @@ -71,7 +72,7 @@ delimiter option of Harry, it computes the similarity of the sets of words contained in the strings. - harry -m sim_jaccard -d ' ' toydata.txt - + harry -m sim_jaccard -d ' ' data.txt - ## Endless Options @@ -88,7 +89,7 @@ You can then edit the configuration file, adapt it to your needs and use it later for running Harry as follows: - harry -c harry.cfg toydata.txt - + harry -c harry.cfg data.txt - Note that you can always override parameters on the command line and thus the configuration file can be used as a base setup for running Harry in @@ -106,7 +107,7 @@ man harry -## The Power of OpenMP +## Multi-Core Computing If you are running a multi-core system, Harry automatically utilizes all cores for computing the similarity measure. Obviously with only four @@ -114,13 +115,13 @@ demonstrate this feature we just replicate the content of the example file, as follows: - for i in `seq 1 1000` ; do cat toydata.txt ; done > large-toydata.txt + for i in `seq 1 1000` ; do cat data.txt ; done > large-data.txt - The resulting file contains 4000 strings and if we run Harry on it + The resulting file contains 4000 strings and if we run Harry on it the computation takes significantly longer. We use the option `-v` to display a progress bar. - harry -v large-toydata.txt matrix.txt + harry -v large-data.txt matrix.txt If you monitor the CPU usage while running this command, you can (hopefully) see how all cores are used. You can use the option `-n` to @@ -133,20 +134,20 @@ this will not happen very often, but in our example we have 3996 duplicate strings and thus this option boosts the computation time. - harry -v -g large-toydata.txt matrix.txt + harry -v -g large-data.txt matrix.txt ## Ranges and Splits - So far we have only compute full square matrices. Often however, one is + So far we have only computed full square matrices. Often however, one is only interested in comparing one set of strings with another set of strings. Harry supports this setting using ranges that can be defined on the x-axis and y-axis of the matrix. For example, we can compare the first two strings in our example file, with the last two by running: - harry -x 0:2 -y 2:4 toydata.txt - + harry -x 0:2 -y 2:4 data.txt - The ranges are defined similar to Python array indices, where the first - value s defines the index of the first string and the second value defines + value defines the index of the first string and the second value defines the index after the last string. If the start or end index is omitted, the minimum or maximum value is @@ -156,7 +157,7 @@ all strings except for the last one. We can write the above command hence as follows - harry -x :-2 -y 2: toydata.txt - + harry -x :-2 -y 2: data.txt - For convenience, Harry supports another option that can be used to split the computation of a matrix into n pieces. This open comes handy if you @@ -164,13 +165,51 @@ different hosts. The following four commands each compute one split out of four splits. - harry -s 4:0 toydata.txt split0.txt - harry -s 4:1 toydata.txt split1.txt - harry -s 4:2 toydata.txt split2.txt - harry -s 4:3 toydata.txt split3.txt + harry -s 4:0 data.txt split0.txt + harry -s 4:1 data.txt split1.txt + harry -s 4:2 data.txt split2.txt + harry -s 4:3 data.txt split3.txt The matrices are split row-wise. That is, the resulting output can be simply concatenated to yield the original similarity matrix cat split?.txt > matrix.txt +## Output Formats + + Harry supports different output formats. In the previous examples you have + already seen the simple text format that can be used with many analysis + tools. We now have a look at the Matlab output format: + + harry -o matlab data.txt matrix.mat + + You can easily access the computed similarity values from the Matlab + environment by loading the file `matrix.mat`. + + > data = load('matrix.mat') + data = + scalar structure containing the fields: + matrix = + 0 8 29 28 + 8 0 29 32 + 29 29 0 25 + 28 32 25 0 + + Similarly, you can use Harry to output the similarity values as a JSON + object using the following command: + + harry -o json data.txt matrix.json + + JSON has been designed for use in JavaScript; however, it is also a handy + format for programming in Python. Here is some example in Python + code. + + import json + data = json.load(open('matrix.json')) + for row in data['matrix']: + print row + +## Conclusions + + Have fun with Harry! + Konrad & Christian diff --git a/src/hstring.c b/src/hstring.c index 0b1c395..173c6a3 100644 --- a/src/hstring.c +++ b/src/hstring.c @@ -551,7 +551,7 @@ static void soundex(char *in, int len, char *out) /** * Perform a soundex transformation of each word. - * @param s string + * @param x string */ hstring_t hstring_soundex(hstring_t x) {