Skip to content

Commit

Permalink
updated tutorial text
Browse files Browse the repository at this point in the history
  • Loading branch information
rieck committed Oct 22, 2014
1 parent df64f13 commit 5a85dde
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 34 deletions.
105 changes: 72 additions & 33 deletions examples/TUTORIAL.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,46 @@
# A Small Tutorial

This page provides a small tutorial for using Harry and its options. We
will use the following text file as input data:
[`toydata.txt`](toydata.txt). The file contains only four lines and
thus we are looking at only four different strings in this tutorial.
This document provides a small tutorial for using Harry and its options.
We will use the following text file as input data: [`data.txt`](data.txt).
The file contains only four lines and thus we are looking at only four
different strings in this tutorial.

## Computing a Similarity Matrix

Let's start by simply computing a similarity matrix for the four strings.
Let's start by simply computing a similarity matrix for the four strings.
The default similarity measure implemented in Harry is the Levenshtein
distance (edit distance). We thus get a 4x4 matrix of distance values, if
we run:

harry toydata.txt -
harry data.txt -

Harry will print the matrix to standard out (stdout), since no output file
is selected. Alternatively, we can write the matrix to a file by running:

harry toydata.txt matrix.txt
harry data.txt matrix.txt

## Different Similarity Measures

The Levenshtein distance is only one of many similarity measures for
strings. To see a list of similarity measures implemented by Harry, run
strings. To see a list of similarity measures supported by Harry, run
the following command.

harry -M

Note that distances start with `dist`, while kernel functions and
similarity coefficients are prefixed by `kern` and `sim`, respectively.
similarity coefficients are prefixed by `kern` and `sim`, respectively.
The latter two compute the similarity of two strings, that is, the
returned value increases with the similarity of the strings, whereas for
distances it decreases. Let's have some fun and compute a couple of
different similarity matrices
returned value increases with the similarity of the strings. By contrast,
the distances compute the dissimilarity of two strings and thus the
returned value decreases with the similarity of the strings. Let's have
some fun and compute a couple of different similarity measures:

harry -m dist_hamming toydata.txt -
harry -m dist_jaro toydata.txt -
harry -m kern_spectrum toydata.txt -
harry -m sim_jaccard toydata.txt -
harry -m dist_hamming data.txt -
harry -m dist_jaro data.txt -
harry -m kern_spectrum data.txt -
harry -m sim_jaccard data.txt -

Each of the similarity measures emphasizes different aspects of the
Each of the similarity measures emphasizes different aspects of the
strings. Just have a look at Wikipedia to learn a little bit about how
these are computed.

Expand All @@ -56,7 +57,7 @@
the different distances, kernel functions and similarity coefficients
simply operate on words instead of characters. Try this command:

harry -d' ' toydata.txt -
harry -d ' ' data.txt -

Note how the distances differ from the first example, where you compute
the Levenshtein distance for the characters and not the words. You can
Expand All @@ -71,7 +72,7 @@
delimiter option of Harry, it computes the similarity of the sets of words
contained in the strings.

harry -m sim_jaccard -d ' ' toydata.txt -
harry -m sim_jaccard -d ' ' data.txt -

## Endless Options

Expand All @@ -88,7 +89,7 @@
You can then edit the configuration file, adapt it to your needs and use
it later for running Harry as follows:

harry -c harry.cfg toydata.txt -
harry -c harry.cfg data.txt -

Note that you can always override parameters on the command line and thus
the configuration file can be used as a base setup for running Harry in
Expand All @@ -106,21 +107,21 @@

man harry

## The Power of OpenMP
## Multi-Core Computing

If you are running a multi-core system, Harry automatically utilizes all
cores for computing the similarity measure. Obviously with only four
strings in our example data, this feature is not necessary. To
demonstrate this feature we just replicate the content of the example
file, as follows:

for i in `seq 1 1000` ; do cat toydata.txt ; done > large-toydata.txt
for i in `seq 1 1000` ; do cat data.txt ; done > large-data.txt

The resulting file contains 4000 strings and if we run Harry on it
The resulting file contains 4000 strings and if we run Harry on it
the computation takes significantly longer. We use the option `-v`
to display a progress bar.

harry -v large-toydata.txt matrix.txt
harry -v large-data.txt matrix.txt

If you monitor the CPU usage while running this command, you can
(hopefully) see how all cores are used. You can use the option `-n` to
Expand All @@ -133,20 +134,20 @@
this will not happen very often, but in our example we have 3996 duplicate
strings and thus this option boosts the computation time.

harry -v -g large-toydata.txt matrix.txt
harry -v -g large-data.txt matrix.txt

## Ranges and Splits

So far we have only compute full square matrices. Often however, one is
So far we have only computed full square matrices. Often however, one is
only interested in comparing one set of strings with another set of
strings. Harry supports this setting using ranges that can be defined on
the x-axis and y-axis of the matrix. For example, we can compare the
first two strings in our example file, with the last two by running:

harry -x 0:2 -y 2:4 toydata.txt -
harry -x 0:2 -y 2:4 data.txt -

The ranges are defined similar to Python array indices, where the first
value s defines the index of the first string and the second value defines
value defines the index of the first string and the second value defines
the index after the last string.

If the start or end index is omitted, the minimum or maximum value is
Expand All @@ -156,21 +157,59 @@
all strings except for the last one. We can write the above command hence
as follows

harry -x :-2 -y 2: toydata.txt -
harry -x :-2 -y 2: data.txt -

For convenience, Harry supports another option that can be used to split
the computation of a matrix into n pieces. This open comes handy if you
want to distribute the computation of a large similarity matrix over
different hosts. The following four commands each compute one split out
of four splits.

harry -s 4:0 toydata.txt split0.txt
harry -s 4:1 toydata.txt split1.txt
harry -s 4:2 toydata.txt split2.txt
harry -s 4:3 toydata.txt split3.txt
harry -s 4:0 data.txt split0.txt
harry -s 4:1 data.txt split1.txt
harry -s 4:2 data.txt split2.txt
harry -s 4:3 data.txt split3.txt

The matrices are split row-wise. That is, the resulting output can be
simply concatenated to yield the original similarity matrix

cat split?.txt > matrix.txt</pre>

## Output Formats

Harry supports different output formats. In the previous examples you have
already seen the simple text format that can be used with many analysis
tools. We now have a look at the Matlab output format:

harry -o matlab data.txt matrix.mat

You can easily access the computed similarity values from the Matlab
environment by loading the file `matrix.mat`.

> data = load('matrix.mat')
data =
scalar structure containing the fields:
matrix =
0 8 29 28
8 0 29 32
29 29 0 25
28 32 25 0

Similarly, you can use Harry to output the similarity values as a JSON
object using the following command:

harry -o json data.txt matrix.json

JSON has been designed for use in JavaScript; however, it is also a handy
format for programming in Python. Here is some example in Python
code.

import json
data = json.load(open('matrix.json'))
for row in data['matrix']:
print row</pre>

## Conclusions

Have fun with Harry!
Konrad & Christian
2 changes: 1 addition & 1 deletion src/hstring.c
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,7 @@ static void soundex(char *in, int len, char *out)

/**
* Perform a soundex transformation of each word.
* @param s string
* @param x string
*/
hstring_t hstring_soundex(hstring_t x)
{
Expand Down

0 comments on commit 5a85dde

Please sign in to comment.