Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similarity Metrics based on Word2Vec for Concept Lists in Different Languages #10

Open
LinguList opened this issue Aug 17, 2022 · 12 comments

Comments

@LinguList
Copy link
Contributor

We should add some of these in a folder data/ for now, with additional information and code there. The resulting network file should then be accessible from within the pysem library.

@LinguList
Copy link
Contributor Author

There is this collection, which might be helpful: https://www.marekrei.com/projects/vectorsets/

@LinguList
Copy link
Contributor Author

And we don't need to bother about Spanish word2vec, we only want to see if this works or not, so we can start with English.

@fractaldragonflies
Copy link
Collaborator

fractaldragonflies commented Aug 24, 2022

So the scope here is to really just develop a few similarity metrics and demonstrate there use and usefulness to integrate into our grander project of incorporating a similarity measure into cognate matching and borrowing detection. Sorry I had lost sight of this for a moment!

We can start with word2vec in English (as @LinguList suggested) and consider glove or fasttext of the more recent vector sets if this seems promising [or even start with fasttext]. An advantage of fasttext is that it works at subword level and so might be more forgiving of smaller discrepancies in word form.

I previously developed a simple class function for finding closest match and performing vector operations to emulate the logical relationships noted by Mikolov in his original papers. Although this was for similarity of IPA segments in my research project for phonology, it at least gives me a starting point.

Here are word2vec https://www.tensorflow.org/tutorials/text/word2vec and fasttext references https://fasttext.cc/docs/en/support.html. I like the API provided by fasttext with some functions already defined, but their 4.5GB download of English seems excessive compared to the 127MB representation of English distributed by Marekrei (above).

@LinguList
Copy link
Contributor Author

Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an examples repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:

from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish")

Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.

@fractaldragonflies
Copy link
Collaborator

If our product is for use by semantic matching program with limited vocabulary, then we could use any of available embeddings in Spanish or Portuguese (staying with English for prototyping) to develop a resource of semantic distances between words without retaining the embeddings themselves.

  • FastText is better for inexact matches of words since it uses sub-word level to develop embeddings, but maybe factor of 10 larger. But if not retained after developing distances, then not a problem.
  • We could experiment with both Word2Vec and FastText.
  • Algorithm for calculating distances would be to just load embeddings into NumPy and compute (vector calculation) dot products of cosine distances between all words in our wordlist.
    • Assumption: our wordlist is order of thousands and not 10s of thousands or greater.
    • Can calculate both dot product and cosine distance measures. Dot product still retains something of frequency information.
  • Resulting data structure would be:
    • bidirectional-dictionary (like we use for segments) between words and indices into similarity store.
    • either pairwise (triangular to reduce space if really big) matrix of similarities where (x, y) indices a similarity measure.
      • For 3,000 words, 4.5M triangular similarity matrix size.
    • or sparse dictionary where only similarities > S are retained as order dictionary of (idx, sim) pairs.
      • For 3,000 words and limit to 50 most similar words (phrases), 0.3M dictionary size.
    • Sparse structure might be both space and time efficient versus the other. As long as 50 (or similar) is OK size for candidate words.

Even if we use something big like FastText, as long as we can limit our vocabulary to something like that given above, then we could have an efficient similarity measure.

@LinguList
Copy link
Contributor Author

I am okay to test several variants. We can later also zip the data and then unzip with Python (but I don't know what compression can do here). Since this is what we have in pysem now: we zip the Concepticon data and unzip them when loading the library, which I find fine for this very purpose and performance.

But that would mean we compute for example these similarities for all concepticon entries in Spanish. The link to concepticon can later be inferred or also stored along.

BTW: we can then also compare with pysem's STARLING similarities, which are rarely used but interesting.

@fractaldragonflies
Copy link
Collaborator

Some words for concepts and also the concepts themselves with English gloss are multi-word expressions. We wouldn't look these up directly, but rather add together corresponding word embeddings (dropping entire functional words such as 'the', 'is',...) and as long as there is not a negation as part of the phrase. Negative would result in a subtraction instead of addition.

Would be a bit special handling since not part of the embeddings themselves.

@LinguList
Copy link
Contributor Author

LinguList commented Aug 26, 2022 via email

@fractaldragonflies
Copy link
Collaborator

Possible use case of Peruvian language with possible Spanish borrowings:

  1. Researcher has available dictionary of common Spanish terms including the Concepticon and IDS inventories. Add to that Spanish words and phrases from constructed bi-lingual dictionary.
  2. Use this to limit search for similar words (in both phonetically and semantically), so as not to deal with massive embeddings - except for the first time when selecting embeddings to construct similarities.3.
  3. Then, for each target word in bi-lingual dictionary:
    3.1 check for sound sequence similarity of Spanish words with sufficient semantic similarity.
    3.2 maybe double thresholds, or an SVM learned function based on some initial training set?
    3.3 using a function something like what @LinguList presented above (and copied below).

OK, enough for today on this!!

Hm, I am wondering how to start. What seems best is to test an approach by which one shows in an examples repository here how a matrix with pairwise similarities between n words for a given language can be computed. This would provide instructions for local download of the stuff and a requirements.txt file. We can then apply this to some master word file that we automatically extract from all we have in Concepticon at a certain point for a certain language, or we use even some other master list (pysem is not dependent on Concepticon). We then discuss how to integrate the data by making a zip-file and accessing the matrix, so one would have a lookup function that looks, e.g. like:

from pysem.fasttext import similarity
print(similarity("mano", "dio", "Spanish")

Of course, one may think of different "Spanish" forms, but we discuss this later. On the long run, this may contribute to norare, but we need to start with some examples for now.

Maybe a bit redundant from my comment above:

  • We probably don't want to compute similarities between all words from something like Word2Vec or FastText as they index hundreds of thousands of words.
  • But with an additional list of candidate words for a particular language (e.g., the 3,000 words from combined Concepticon and IDS Spanish wordlists) plus the Spanish entries from a bi-lingual field researcher's dictionary, might be used to select a more relevant 5,000 words?
  • With that we have only 12 million resulting pairwise similarities, or just 500 thousand if we limit to just top 100 most similar. This is pretty doable.

@fractaldragonflies
Copy link
Collaborator

OK, I've been trying to go directly to a nice solution here for our similarity function similar to what @LinguList describes, but maybe better I'll try in a few steps. Just in case I am going down a garden path to nowhere, this will make sure I don't get too lost! So this week at least I'll have 1 or 2 iterations on developing such measures.

@fractaldragonflies
Copy link
Collaborator

Prototype - used English wordlist from WOLD (n=1480), embeddings from Word2Vec.
Limited cleaning of words from wordlist... since we want this to be for other languages.
Calculation of all pairs of similarities and then report of top similarities for all words in wordlist.
Due to smaller size of wordlist, no problem with space, or time.

With larger wordlist, space and time increases too, approximately as the square of wordlist size.

For vocabulary of 1,500, it's about 25Mb for the all similarities.
So going to 15,000 words will take us to 2.5Gb.

But if we just set a threshold at say the top 100 similarities, then we are back to 25Mb again!

Multi-word words were broken up into separate words and their embeddings added together, since Word2Vec is single word embeddings. This seems OK in general.

lightning bolt, [('lightning', 0.8545579398629692), ('bolt', 0.8294791950147321), ('rattle', 0.7111561418515908)]
old woman, [('old man', 0.9438529208904961), ('woman', 0.8419059624634133), ('old', 0.8200590160797763)]

More earthy expressions seem OK too.

fart, [('surprised', 0.6738966824249598), ('shit', 0.6663192026222906), ('stupid', 0.6431720347809629)]
intestines, [('pus', 0.7338402318487939), ('lung', 0.6745549328497172), ('vagina', 0.6719459557266357)]

Sample of report on top 3 similarities:

% python examples/makesimsforwl.py en --input nnet_vectors.100.txt
Embeddings vocab and vector size: b'132430 100\n'
Embeddings: vocabulary shape (132430, 100).
Size: vocab 10,485,856, embeddings 105,944,128.
Size wordlist 1,483.
Emb is None for netbag.
Emb is None for earwax.
Emb is None for goitre.
Emb is None for ridgepole.
Emb is None for fishhook.
Size: 2950, (1475, 100)
1475
fry, [('butter', 0.8042854349842384), ('mushroom', 0.7832237533109371), ('bake', 0.779592179181041)]
thousand [[numword]], [('zero [[numword]]', 0.9999999999999999), ('five [[numword]]', 0.9999999999999999), ('eleven [[numword]]', 0.9999999999999999)]
cow, [('pig', 0.8156664193500531), ('goat', 0.7868192683936832), ('sheep', 0.7684552210883713)]
fingernail, [('pubic hair', 0.8165453047136368), ('eyelash', 0.8163303040424418), ('forehead', 0.7971889532811858)]
tobacco, [('cigarette', 0.6017402928379825), ('beer', 0.5865598741226794), ('sugar cane', 0.5685409505745453)]
adze, [('spindle', 0.7199174818346881), ('chisel', 0.7089135671344956), ('scythe', 0.6888713543774874)]
same, [('all', 0.5434899643577488), ('certain', 0.5130263619592609), ('this', 0.4873846372182848)]
hour, [('day', 0.7139206736020832), ('week', 0.674579471160923), ('month', 0.6230577859429078)]
four [[numword]], [('three [[numword]]', 0.9263888505161936), ('two [[numword]]', 0.9136727320972422), ('thousand [[numword]]', 0.9087765795191391)]
rich, [('poor', 0.5997822231183239), ('beautiful', 0.5426400096267676), ('farmer', 0.5304052826318136)]
sweets, [('vegetables', 0.7424877220261447), ('oat', 0.727479115233296), ('chili', 0.6851475577781612)]
feather, [('beak', 0.7660150859872727), ('fur', 0.7457001005339128), ('grass-skirt', 0.7364235366406179)]
molar tooth, [('molar', 0.866042299130694), ('tooth', 0.7450526467968077), ('jaw', 0.6625907030130996)]
mad, [('stupid', 0.6886028664386614), ('hell', 0.5959723895311018), ('fuck', 0.5900788603563875)]
quiet, [('calm', 0.6108550346577949), ('silent', 0.5786421637006947), ('happy', 0.5343008682566188)]
row, [('square', 0.5814702629263717), ('fence', 0.5766758836803482), ('corner', 0.5570105393974787)]
stall, [('shop', 0.6325477955329423), ('shed', 0.5645848221567649), ('basket', 0.5410588588616182)]
sour, [('sweet', 0.6617518076736664), ('sweet potato', 0.6498828271965637), ('bitter', 0.6437554500148425)]
cormorant, [('gull', 0.7726610987877149), ('elk', 0.7323552200753589), ('vulture', 0.7312522635456679)]
kill, [('injure', 0.783690821448602), ('dead', 0.7546870853575245), ('attack', 0.7372081098673927)]
be silent, [('silent', 0.8290224765531314), ('alive', 0.5187246123662198), ('remain', 0.5018385311063606)]
have, [('be', 0.6024557496250549), ('that', 0.6001841380569435), ('now', 0.5584250693894303)]
bread, [('cheese', 0.913873037065727), ('butter', 0.9117029890600533), ('soup', 0.8814869916880362)]
debt, [('pay', 0.5594818992876731), ('tax', 0.5593297139708442), ('bank', 0.5555438402428177)]
citizen, [('surrender', 0.5397944548207891), ('govern', 0.5245086218522322), ('country', 0.5121272200289608)]
wall, [('roof', 0.781861785536561), ('stone', 0.744606718460037), ('brick', 0.7338536697433112)]
pay, [('earn', 0.6218248064656049), ('money', 0.6159720907476571), ('tax', 0.5776560505378086)]
horse, [('ride', 0.7273398447119427), ('dog', 0.7055156016290415), ('donkey', 0.702881748835211)]
harvest, [('barley', 0.6656591601710853), ('wheat', 0.657589181841732), ('sow', 0.6266762806539685)]
north, [('south', 0.9061287971740142), ('west', 0.8422640462874527), ('east', 0.7556908191003076)]
thresh, [('earlobe', 0.7033262878150472), ('scythe', 0.6989489765427765), ('hammock', 0.6947732468168636)]
dolphin, [('porpoise', 0.8673096237321372), ('whale', 0.8246386692137381), ('shark', 0.7339749092418)]
ride, [('horse', 0.7273398447119427), ('fly', 0.6522835865819004), ('sail', 0.6307780989824938)]
silent, [('be silent', 0.8290224765531314), ('quiet', 0.5786421637006947), ('darkness', 0.5652802961995096)]
straight, [('go down', 0.6109444386264571), ('pull', 0.6107878782156458), ('walk', 0.6066075947977537)]
dirty, [('rag', 0.6769145691305547), ('wet', 0.5831675490232032), ('bore', 0.5806651588786926)]
earn, [('pay', 0.6218248064656049), ('borrow', 0.5949876604119079), ('lend', 0.5577967991243604)]
bamboo, [('cone', 0.7425212510586381), ('tree trunk', 0.7200045010482233), ('grass', 0.7146550130433396)]

--- 8.09983491897583 seconds ---

@LinguList
Copy link
Contributor Author

LinguList commented Sep 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants