Testing unknown strings throws Key Error #1

sebader · 2018-06-21T07:19:15Z

First of all thanks a lot @jgarciab for your work and sharing it, great tutorial!
I've implemented all your steps on an Azure Databricks Cluster with only a few modifications and successfully trained it against my own data set.
My issue now is that I cannot run the final step to test the similarity of two string, which were neither in the test, nor training set:

str1 = "Hello World"
str2 = "Hallo Welt"
distances = find_distances(str1, str2)
clf.decision_function(np.array(distances,dtype=float))

throws:

Key Error at...
<command-3912256500001942> in cosineWords(a, b, dictTrain, tfidf_matrix_train)
     76     """
     77     ind_a = dictTrain[a.lower().rstrip()]
---> 78     ind_b = dictTrain[b.lower().rstrip()]
     79     score = cosine_similarity(tfidf_matrix_train[ind_a:ind_a+1], tfidf_matrix_train[ind_b:ind_b+1])
     80     return score

If str1 or str2 were not in the training set, there are obviously not in the dictTrain array.
Did I misunderstand something here or is this a bug?

Thanks!

The text was updated successfully, but these errors were encountered:

sebader · 2018-06-21T07:29:37Z

oh, often writing about it helps to understand it :) Just realized that dictTrain is in fact the combined list of all strings in database1 and database2.
So then my question would be a little different: How can I use this to match new strings which are not part of either db1 or db2? Or is the solution specific to two know and fixed data sets?

jgarciab · 2018-06-21T18:06:35Z

Hi @sebader, what exactly are you trying to do?

The solution is to merge two arbitrary datasets ("./D/database_1.csv" and "./D/database_2.csv") in the example. For that it uses a training set ("./D/test.csv").

If you want to merge three datasets (db1,db2,db3) you need to decide which one is your base case (db1 or db2) and merge db1_db2 and db1_db3, or merge them and use the combination as the base case (combine the merged db1_db2 with db3).

sebader · 2018-06-21T20:35:14Z

The problem I'm trying to solve is basically the same. Two different data sources of strings (description of dishes from a menu). I have a certain data set already of both sources - which I used to train. But new data keeps flowing into both data sources daily and I would like to use the solution to match those as well. Or would I have to retrain the system every day?!

jgarciab · 2018-06-21T20:41:11Z

I'm afraid you'd need to retrain it every day, but you should be able to adapt the code to avoid that.

I don't have time right now to update the code making it more flexible, but maybe at some point this summer.

sebader · 2018-06-21T20:43:15Z

ok got it. Maybe I'll try to figure it out in the meantime but I'll watch your repo as well :)
Thanks a lot for your work in any case!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing unknown strings throws Key Error #1

Testing unknown strings throws Key Error #1

sebader commented Jun 21, 2018

sebader commented Jun 21, 2018

jgarciab commented Jun 21, 2018

sebader commented Jun 21, 2018

jgarciab commented Jun 21, 2018

sebader commented Jun 21, 2018

Testing unknown strings throws Key Error #1

Testing unknown strings throws Key Error #1

Comments

sebader commented Jun 21, 2018

sebader commented Jun 21, 2018

jgarciab commented Jun 21, 2018

sebader commented Jun 21, 2018

jgarciab commented Jun 21, 2018

sebader commented Jun 21, 2018