Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing unknown strings throws Key Error #1

Open
sebader opened this issue Jun 21, 2018 · 5 comments
Open

Testing unknown strings throws Key Error #1

sebader opened this issue Jun 21, 2018 · 5 comments

Comments

@sebader
Copy link

sebader commented Jun 21, 2018

First of all thanks a lot @jgarciab for your work and sharing it, great tutorial!
I've implemented all your steps on an Azure Databricks Cluster with only a few modifications and successfully trained it against my own data set.
My issue now is that I cannot run the final step to test the similarity of two string, which were neither in the test, nor training set:

str1 = "Hello World"
str2 = "Hallo Welt"
distances = find_distances(str1, str2)
clf.decision_function(np.array(distances,dtype=float))

throws:

Key Error at...
<command-3912256500001942> in cosineWords(a, b, dictTrain, tfidf_matrix_train)
     76     """
     77     ind_a = dictTrain[a.lower().rstrip()]
---> 78     ind_b = dictTrain[b.lower().rstrip()]
     79     score = cosine_similarity(tfidf_matrix_train[ind_a:ind_a+1], tfidf_matrix_train[ind_b:ind_b+1])
     80     return score

If str1 or str2 were not in the training set, there are obviously not in the dictTrain array.
Did I misunderstand something here or is this a bug?

Thanks!

@sebader
Copy link
Author

sebader commented Jun 21, 2018

oh, often writing about it helps to understand it :) Just realized that dictTrain is in fact the combined list of all strings in database1 and database2.
So then my question would be a little different: How can I use this to match new strings which are not part of either db1 or db2? Or is the solution specific to two know and fixed data sets?

@jgarciab
Copy link
Owner

Hi @sebader, what exactly are you trying to do?

The solution is to merge two arbitrary datasets ("./D/database_1.csv" and "./D/database_2.csv") in the example. For that it uses a training set ("./D/test.csv").

If you want to merge three datasets (db1,db2,db3) you need to decide which one is your base case (db1 or db2) and merge db1_db2 and db1_db3, or merge them and use the combination as the base case (combine the merged db1_db2 with db3).

@sebader
Copy link
Author

sebader commented Jun 21, 2018

The problem I'm trying to solve is basically the same. Two different data sources of strings (description of dishes from a menu). I have a certain data set already of both sources - which I used to train. But new data keeps flowing into both data sources daily and I would like to use the solution to match those as well. Or would I have to retrain the system every day?!

@jgarciab
Copy link
Owner

I'm afraid you'd need to retrain it every day, but you should be able to adapt the code to avoid that.

I don't have time right now to update the code making it more flexible, but maybe at some point this summer.

@sebader
Copy link
Author

sebader commented Jun 21, 2018

ok got it. Maybe I'll try to figure it out in the meantime but I'll watch your repo as well :)
Thanks a lot for your work in any case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants