-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similarity Metrics based on Word2Vec for Concept Lists in Different Languages #10
Comments
There is this collection, which might be helpful: https://www.marekrei.com/projects/vectorsets/ |
And we don't need to bother about Spanish word2vec, we only want to see if this works or not, so we can start with English. |
So the scope here is to really just develop a few similarity metrics and demonstrate there use and usefulness to integrate into our grander project of incorporating a similarity measure into cognate matching and borrowing detection. Sorry I had lost sight of this for a moment! We can start with word2vec in English (as @LinguList suggested) and consider glove or fasttext of the more recent vector sets if this seems promising [or even start with fasttext]. An advantage of fasttext is that it works at subword level and so might be more forgiving of smaller discrepancies in word form. I previously developed a simple class function for finding closest match and performing vector operations to emulate the logical relationships noted by Mikolov in his original papers. Although this was for similarity of IPA segments in my research project for phonology, it at least gives me a starting point. Here are word2vec https://www.tensorflow.org/tutorials/text/word2vec and fasttext references https://fasttext.cc/docs/en/support.html. I like the API provided by fasttext with some functions already defined, but their 4.5GB download of English seems excessive compared to the 127MB representation of English distributed by Marekrei (above). |
Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish") Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now. |
If our product is for use by semantic matching program with limited vocabulary, then we could use any of available embeddings in Spanish or Portuguese (staying with English for prototyping) to develop a resource of semantic distances between words without retaining the embeddings themselves.
Even if we use something big like FastText, as long as we can limit our vocabulary to something like that given above, then we could have an efficient similarity measure. |
I am okay to test several variants. We can later also zip the data and then unzip with Python (but I don't know what compression can do here). Since this is what we have in pysem now: we zip the Concepticon data and unzip them when loading the library, which I find fine for this very purpose and performance. But that would mean we compute for example these similarities for all concepticon entries in Spanish. The link to concepticon can later be inferred or also stored along. BTW: we can then also compare with pysem's STARLING similarities, which are rarely used but interesting. |
Some words for concepts and also the concepts themselves with English gloss are multi-word expressions. We wouldn't look these up directly, but rather add together corresponding word embeddings (dropping entire functional words such as 'the', 'is',...) and as long as there is not a negation as part of the phrase. Negative would result in a subtraction instead of addition. Would be a bit special handling since not part of the embeddings themselves. |
This only warrants our special treatment, as we then use these metrics
to provide concept-based similarities through a given language. Guess
that even makes it nicer.
|
Possible use case of Peruvian language with possible Spanish borrowings:
OK, enough for today on this!!
Maybe a bit redundant from my comment above:
|
OK, I've been trying to go directly to a nice solution here for our similarity function similar to what @LinguList describes, but maybe better I'll try in a few steps. Just in case I am going down a garden path to nowhere, this will make sure I don't get too lost! So this week at least I'll have 1 or 2 iterations on developing such measures. |
Prototype - used English wordlist from WOLD (n=1480), embeddings from Word2Vec. With larger wordlist, space and time increases too, approximately as the square of wordlist size. For vocabulary of 1,500, it's about 25Mb for the all similarities. But if we just set a threshold at say the top 100 similarities, then we are back to 25Mb again! Multi-word words were broken up into separate words and their embeddings added together, since Word2Vec is single word embeddings. This seems OK in general.
More earthy expressions seem OK too.
Sample of report on top 3 similarities:
|
Nice start. I am traveling-teaching this week, but when I find time and
get back, I should share some info on a bachelor project I supervised,
where we discussed various metrics and tried to compare them. Not
necessarily all successful, but we could build a bit on that.
It could be an entire study on comparing these similarities, as this has
not often been done so far, as far as I know.
No we should decide on a common format (I propose to use zipped json
files, some dictionary structure for similarities) and see how we can
put this into a simple function.
In parallel, one can check how well these similarities integrate into an
SVM in our borrowing detection approach.
|
We should add some of these in a folder
data/
for now, with additional information and code there. The resulting network file should then be accessible from within thepysem
library.The text was updated successfully, but these errors were encountered: