Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normal sentence data is biased #6

Open
DanielRapp opened this issue Jan 12, 2012 · 3 comments
Open

Normal sentence data is biased #6

DanielRapp opened this issue Jan 12, 2012 · 3 comments

Comments

@DanielRapp
Copy link
Owner

The data for normal sentences (collected from fmylife.com) is biased towards the word "was" (and probably a lot of other things). A good resource for new normal sentences may be wikipedia.

@entaroadun
Copy link

Wikipedia will not work either. You need to take sentences from twitter that are not fallowed by TWSS. Positive and negative examples have to be at least little bit similar otherwise you are creating a classifier that will distinguish between sentences from Twitter and Wiki.

@DanielRapp
Copy link
Owner Author

Sentences that was and wasn't replied with "twss" or "that's what she said" on twitter has been collected. But the positive sentences are way too noisy to be usable.

Right now the sentences that can be replied with "that's what she said" all come from http://twssstories.com/

@entaroadun
Copy link

How about if you created classifier that distinguishes between FML (http://fmylife.com) and TWSS (http://twssstories.com/) stories? Both sources have well curated data if you use only those examples above certain "like" or "I agree, your life sucks" + "you deserved it" threshold. Or you could use simply best 2000 examples from each website.

Normalize all vectors to 1 [ x := x/sqrt(sum(x*x)) ]. This way for any two sentences the scalar product will be between 0 and 1.

For any new sentence you check first 20 nearest neighbors. if the number of FML ~ TWSS neighbors is similar then it means its neither FML nor TWSS but if you get FML >> TWSS then you label it FML, if TWSS >> FML then you label it with TWSS.

The data will be noisy by its nature. If you get accuracy > 0.6 you can stop optimizing your algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants