"crossing" originally started as a software project by Dennis Ulmer and Sebastian Spaar during the summer semester 2014 at Heidelberg University, Germany.
In theory, crossing
tries to create a transformation matrix from one
Vector Space Model in language A
to another one in language B using a provided dictionary (for instance, German-English).
Then -- taking an unknown vector v
in language A (a word not found in the
dictionary) -- crossing
can transform that vector into language B using the calculated
transformation matrix and looking for the most similiar vector in language B.
During the software project, crossing
was used to analyze anglicisms
found in the German language, and whether that anglicism's meaning has changed
compared to the original English word (hence, "CrOssinG" -- CompaRing Of AngliciSmS IN German).
Vector space models were created by using word2vec
on an English and German Wikipedia dump, that were converted to plaintext characters
beforehand using a slightly altered version of WikiExtractor.py
.
These tools can be found in the opt/
directory.
Many thanks to http://www.dict.cc that provided us with a German-English dictionary.
crossing
requires the following Python packages:
- NumPy
- SciPy
- scikit-Learn
- nose (a requirement of scikit-learn, sometimes needed for installation)
- BeautifulSoup (for the scripts found in
bin/
)
Use pip install -r requirements.txt
to install crossing
and its requirements.
Using a virtual environment is recommended for not spamming your system packages
with a small software project.
crossing
's usage can easily be learned by using it interactively in a
Python interpreter. Make sure to install crossing
and its dependencies,
open a Python interpreter and import it:
>>> import crossing
There is some example data prepared in the share/
directory:
share
├── de.txt
├── de_dummy.txt
├── de_vectors.txt
├── dict.txt
├── dict_dummy.txt
├── en.txt
├── en_dummy.txt
└── en_vectors.txt
Of these files, de_vectors.txt
, en_vectors.txt
and dict.txt
are of
particular interest. They are based on the corpus "Town Musicians of Bremen"
found in de.txt/en.txt
. Let's create a VectorTransformator
object that will
serve sevel vector transformation matrices:
>>> vt = crossing.VectorManager.VectorTransformator()
We have to fill our vt
object with some language data. vt
has three variables
that need to be filled: vt.V
and vt.W
represent two vector spaces, and
vt.Dictionary
contains the translation of the words found in vt.V
to vt.W
.
For this example, use the data found in the share/
directory and load them
into vt
using the functions of FileManager.py
:
>>> vt.Dictionary = crossing.FileManager.readDictionaryFile("share/dict.txt")
>>> vt.V = crossing.FileManager.readWord2VecFile("share/de_vectors.txt")
>>> vt.W = crossing.FileManager.readWord2VecFile("share/en_vectors.txt")
(Since we are working with word2vec
data, FileManager.readWord2Vec()
is used.
However, you could pass every dictionary in the following format to vt.V/W
:)
{"word" = [1.0, 2.0, 3.0, ...], "another" = [0.1, 0.2, 0.3, ...], ...}
Remember that VectorTransformator
only wraps several transformation matrices.
This way you could create different transformation models and compare their
accuracies. Let's create a transformation matrix now -- by default, sklearn.Linear_Model.Lasso
with alpha = 0.1
is used (refer to the docstring
to see other models):
>>> vt.createTransformationMatrix()
Let's have a look at the word katze
(German for cat). Its vector form is,
in German and English respectively:
>>> vt.V["katze"]
[0.006136, -0.052587, 0.012688, -0.01403, -0.046991, 0.042845, -0.023529, -0.001199, 0.034139, -0.003296]
>>> vt.W["cat"]
[-0.067114, 0.033746, 0.020565, 0.032246, 0.113999, 0.016741, -0.021005, 0.043264, 0.060346, -0.008794]
We can now see how crossing
would transform the vector for katze
into the
English vector space, using the transformation matrix that was just created:
>>> vt * "katze"
(matrix([[-0.01070324],
[-0.00699281],
[ 0.00408598],
[ 0.00868466],
[ 0.03515451],
[-0.00209241],
[-0.02295664],
[ 0.01283001],
[ 0.01598752],
[-0.00638645]]),)
Most of the time, when using vector information from word2vec
and sklearn.Linear_Models
,
our algorithm fails miserably to create an adequate transformation matrix. One
reason might be that the information provided by word2vec
is not useful for creating
a vector space model of a language, since word2vec
is more of a straightforward
approach of representing words by a numerical value.
Using dummy data, like the _dummy
files found in share/
, creating transformation
matrices works fine.