Normal sentence data is biased #6

DanielRapp · 2012-01-12T21:16:39Z

The data for normal sentences (collected from fmylife.com) is biased towards the word "was" (and probably a lot of other things). A good resource for new normal sentences may be wikipedia.

entaroadun · 2012-01-12T21:23:11Z

Wikipedia will not work either. You need to take sentences from twitter that are not fallowed by TWSS. Positive and negative examples have to be at least little bit similar otherwise you are creating a classifier that will distinguish between sentences from Twitter and Wiki.

DanielRapp · 2012-01-12T21:42:53Z

Sentences that was and wasn't replied with "twss" or "that's what she said" on twitter has been collected. But the positive sentences are way too noisy to be usable.

Right now the sentences that can be replied with "that's what she said" all come from http://twssstories.com/

entaroadun · 2012-01-12T22:50:50Z

How about if you created classifier that distinguishes between FML (http://fmylife.com) and TWSS (http://twssstories.com/) stories? Both sources have well curated data if you use only those examples above certain "like" or "I agree, your life sucks" + "you deserved it" threshold. Or you could use simply best 2000 examples from each website.

Normalize all vectors to 1 [ x := x/sqrt(sum(x*x)) ]. This way for any two sentences the scalar product will be between 0 and 1.

For any new sentence you check first 20 nearest neighbors. if the number of FML ~ TWSS neighbors is similar then it means its neither FML nor TWSS but if you get FML >> TWSS then you label it FML, if TWSS >> FML then you label it with TWSS.

The data will be noisy by its nature. If you get accuracy > 0.6 you can stop optimizing your algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normal sentence data is biased #6

Normal sentence data is biased #6

DanielRapp commented Jan 12, 2012

entaroadun commented Jan 12, 2012

DanielRapp commented Jan 12, 2012

entaroadun commented Jan 12, 2012

Normal sentence data is biased #6

Normal sentence data is biased #6

Comments

DanielRapp commented Jan 12, 2012

entaroadun commented Jan 12, 2012

DanielRapp commented Jan 12, 2012

entaroadun commented Jan 12, 2012