-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normal sentence data is biased #6
Comments
Wikipedia will not work either. You need to take sentences from twitter that are not fallowed by TWSS. Positive and negative examples have to be at least little bit similar otherwise you are creating a classifier that will distinguish between sentences from Twitter and Wiki. |
Sentences that was and wasn't replied with "twss" or "that's what she said" on twitter has been collected. But the positive sentences are way too noisy to be usable. Right now the sentences that can be replied with "that's what she said" all come from http://twssstories.com/ |
How about if you created classifier that distinguishes between FML (http://fmylife.com) and TWSS (http://twssstories.com/) stories? Both sources have well curated data if you use only those examples above certain "like" or "I agree, your life sucks" + "you deserved it" threshold. Or you could use simply best 2000 examples from each website. Normalize all vectors to 1 [ x := x/sqrt(sum(x*x)) ]. This way for any two sentences the scalar product will be between 0 and 1. For any new sentence you check first 20 nearest neighbors. if the number of FML ~ TWSS neighbors is similar then it means its neither FML nor TWSS but if you get FML >> TWSS then you label it FML, if TWSS >> FML then you label it with TWSS. The data will be noisy by its nature. If you get accuracy > 0.6 you can stop optimizing your algorithm. |
The data for normal sentences (collected from fmylife.com) is biased towards the word "was" (and probably a lot of other things). A good resource for new normal sentences may be wikipedia.
The text was updated successfully, but these errors were encountered: