-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing unknown strings throws Key Error #1
Comments
oh, often writing about it helps to understand it :) Just realized that dictTrain is in fact the combined list of all strings in database1 and database2. |
Hi @sebader, what exactly are you trying to do? The solution is to merge two arbitrary datasets ("./D/database_1.csv" and "./D/database_2.csv") in the example. For that it uses a training set ("./D/test.csv"). If you want to merge three datasets (db1,db2,db3) you need to decide which one is your base case (db1 or db2) and merge db1_db2 and db1_db3, or merge them and use the combination as the base case (combine the merged db1_db2 with db3). |
The problem I'm trying to solve is basically the same. Two different data sources of strings (description of dishes from a menu). I have a certain data set already of both sources - which I used to train. But new data keeps flowing into both data sources daily and I would like to use the solution to match those as well. Or would I have to retrain the system every day?! |
I'm afraid you'd need to retrain it every day, but you should be able to adapt the code to avoid that. I don't have time right now to update the code making it more flexible, but maybe at some point this summer. |
ok got it. Maybe I'll try to figure it out in the meantime but I'll watch your repo as well :) |
First of all thanks a lot @jgarciab for your work and sharing it, great tutorial!
I've implemented all your steps on an Azure Databricks Cluster with only a few modifications and successfully trained it against my own data set.
My issue now is that I cannot run the final step to test the similarity of two string, which were neither in the test, nor training set:
throws:
If str1 or str2 were not in the training set, there are obviously not in the dictTrain array.
Did I misunderstand something here or is this a bug?
Thanks!
The text was updated successfully, but these errors were encountered: