Ideas for NLP pre-processing and feature engineering #14

honnibal · 2016-03-06T21:18:46Z

Hi all,

I'm excited to do some work on the text processing side of the Visual QA task. I develop the spaCy NLP library. I think we should be able to get some extra accuracy, with some extra NLP logic on the question parsing side. We'll see.

The first thing I'd like to try is mapping out of vocabulary words to similar tokens, using a word2vec model. For instance, let's say the word colour is OOV. Seems easy to map this to color.

Input: What colour is his shirt?
Tokens: ["What", "colour", "is", "his", "shirt", "?"]
Transform: ["What", "color", "is", "his", "shirt", "?"]

I think this input normalization trick is novel, but it makes sense to me for this problem. It lets you exploit pre-trained vectors without interfering with the rest of your model choices.

I think the normalization could be taken a bit further, by using the POS tagger and parser to compute context-specific keys, so that the replacement could be more exact (sense2vec). I think just the word replacement is probably okay though.

It's also easy to calculate auxiliary features with spaCy. It's easy to train a question classifier, of course. I'm not sure the model is making many errors of that type, though.

If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output. Have you tried having the model output a vector, and using it to find a nearest neighbour?

The text was updated successfully, but these errors were encountered:

dhruvbatra · 2016-03-07T00:40:04Z

Totally agree with the paraphrase issue! We have been brainstorming about this in the lab as well. Simply mapping OOV words to something within the vocabulary is a good place to start.

If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output.

I am not sure I would agree. As our paper explains, most answers in our dataset are 1-3 words long, so it really is /mostly/ a large multiclass classification problem. Our choice of 1K answers in the model is simply one convenient choice. It covers ~82% of all answers.

Have you tried having the model output a vector, and using it to find a nearest neighbour?

No, but how would such a system be trained by backprop? 1-NN isn't amenable to gradient-based learning.

honnibal · 2016-03-07T00:46:44Z

I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial. I think the coverage you observe is mostly a fact about the collection methodology, not about language in general.

Re training: Off the top of my head, maybe noise contrastive estimation? That's how the QANTA paper did it.

dhruvbatra · 2016-03-07T00:50:12Z

I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial.

:-). I would counter that just because the space of answers is small does not make the learning problem easy. Even binary questions such as "Is this person expecting company?" can require fairly heavy lifting on the vision/reasoning side.

honnibal · 2016-03-07T00:59:34Z

Hey, I'm not saying it's easy, or that it's not impressive and interesting :)

But a fixed answer vocabulary isn't the future of this task. I think the technology would take a big step towards practicality if you were learning to produce a meaning representation. That way, to learn a new answer, you just have to learn its vector. If you add another class to the model, you don't know how many weights might have to be adjusted. Probably a lot.

My hunch is that it would actually be better for accuracy, too. But, you know the evaluation much better.

dhruvbatra · 2016-03-07T01:01:18Z

Agreed.

honnibal · 2016-03-07T01:11:12Z

:)

Now, about the paraphrases. There are a couple of ways we could do this.

I could simply compute paraphrased versions of the task data, and give you a pull request with them. But then when someone ran the system with their own image/questions, you wouldn't be able to answer.
I could compute the paraphrase dictionary and check that in.
I could check in a script that generates the paraphrase dictionary.
I can bake the paraphrase mapping into spaCy's lexicon, and add a mode to prepro.py that uses spaCy with paraphrasing.
I could make prepro.py accept a word vectors file as an argument, and then it would compute the paraphrases based on the vectors.

I think 3 and 4 make the most sense.

dhruvbatra · 2016-03-07T01:15:24Z

I like 4 the best.

@jiasenlu @abhshkdz @dexter1691: Do you have a preference?

abhshkdz · 2016-03-07T01:34:01Z

I agree, 3 and 4 would be great.

dexter1691 · 2016-03-07T01:54:06Z

yeah. 3 and 4 makes sense.

honnibal · 2016-03-07T15:06:22Z

Great. I left a job running overnight compiling the paraphrases from GloVe, but I messed something up. Restarting now.

Here's some random substitutions for relatively frequent words. Some of these substitutions look good, others look pretty problematic.

u'Yeah' --> u'all'
u'fuck' --> u'what'
u'fucking' --> u'dude'
u'etc' --> u'toppers'
u'myself' --> u'herself'
u'comment' --> u'leave'
u'thanks' --> u'thank'
u'reddit' --> u'librarian'
u'yeah' --> u'all'
u'whatever' --> u'else'
u'questions' --> u'question'
u'lol' --> u'but'
u'sorry' --> u'okay'
u'hell' --> u'well'
u'Edit' --> u'again'
u'unless' --> u'except'
u'gonna' --> u'#'
u'However' --> u'even'
u'basically' --> u'apparently'
u'knew' --> u'told'
u'comments' --> u'opinion'
u'damn' --> u'dude'
u'Sorry' --> u'okay'
u'worse' --> u'better'
u'account' --> u'credit'
u'advice' --> u'help'
u'Reddit' --> u'librarian'
u'Hey' --> u'hi'
u'subreddit' --> u'#'
u'however' --> u'even'
u'kinda' --> u'seems'
u'Wow' --> u'dude'
u'certainly' --> u'definitely'
u'explain' --> u'describe'

I think it will help a lot to have vectors over POS tagged text. Then we can make sure we don't change the part-of-speech when we do the replacement.

honnibal · 2016-03-07T17:47:03Z

The instructions in the readme don't specify a frequency threshold for the training data. Is this the config you've been running in your experiments?

Below are the number of excluded words at various thresholds. With a frequency threshold of 1 or 5, there seems relatively little advantage to having a complicated replacement scheme. Only 1% of the tokens are affected. With so many other moving parts and difficulties of the task, I doubt changing the representation of those tokens would help very much.

I would be curious to try an aggressive threshold, leaving only one or two thousand words in the vocab.

Threshold 1
number of bad words: 4854/12603 = 38.51%
number of words in vocab would be 7749
number of UNKs: 4854/1537357 = 0.32%

Threshold 5
number of bad words: 8387/12603 = 66.55%
number of words in vocab would be 4216
number of UNKs: 15355/1537357 = 1.00%

Threshold 20
number of bad words: 10449/12603 = 82.91%
number of words in vocab would be 2154
number of UNKs: 37260/1537357 = 2.42%

Threshold 50
number of bad words: 11304/12603 = 89.69%
number of words in vocab would be 1299
number of UNKs: 64905/1537357 = 4.22%

Threshold 100
number of bad words: 11722/12603 = 93.01%
number of words in vocab would be 881
number of UNKs: 94764/1537357 = 6.16%

Threshold 200
number of bad words: 12073/12603 = 95.79%
number of words in vocab would be 530
number of UNKs: 144454/1537357 = 9.40%

Threshold 1000
number of bad words: 12459/12603 = 98.86%
number of words in vocab would be 144
number of UNKs: 317024/1537357 = 20.62%

jiasenlu · 2016-03-07T18:16:54Z

@honnibal I use the th = 0 (preserve all the words) in the pre-trained model. Previous I tried using only top 1000 words in our bag-of-words baseline(between th=50~100). And it seems the performance is much worse than method here[(http://arxiv.org/abs/1512.02167)], which has the similar network structure. So I doubt using aggressive threshold will improve the performance.

honnibal · 2016-03-07T18:46:33Z

I think replacing the tokens with UNK makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.

It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.

Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.

It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.

What regional architecture is represented here?
 [u'what', u'center', u'design', u'is', u'represented', u'here', u'?']
Where is the cat sitting?
 [u'where', u'is', u'the', u'cat', u'sitting', u'?']
What vehicle is in the picture on the wall?
 [u'what', u'vehicle', u'is', u'in', u'the', u'picture', u'on', u'the', u'wall', u'?']
Is the television set turned on or off?
 [u'is', u'the', u'television', u'set', u'turned', u'on', u'or', u'off', u'?']
Is this a professional sport event?
 [u'is', u'this', u'a', u'professional', u'sport', u'event', u'?']
What is the man doing?
 [u'what', u'is', u'the', u'man', u'doing', u'?']
Is the player's uniform dirty?
 [u'is', u'the', u'player', u"'s", u'uniform', u'dirty', u'?']
What sole topping is shown on the pizza?
 [u'what', u'boots', u'topping', u'is', u'shown', u'on', u'the', u'pizza', u'?']
What object is this?
 [u'what', u'object', u'is', u'this', u'?']
What is the pizza sitting in?
 [u'what', u'is', u'the', u'pizza', u'sitting', u'in', u'?']
Can you see the moon here?
 [u'can', u'you', u'see', u'the', u'dark', u'here', u'?']
What is in the sky?
 [u'what', u'is', u'in', u'the', u'sky', u'?']
How does the parachute stay in the air?
 [u'how', u'does', u'the', u'military', u'keep', u'in', u'the', u'air', u'?']
Is the train moving through the countryside?
 [u'is', u'the', u'train', u'moving', u'through', u'the', u'rural', u'?']
What can be read on the train?
 [u'what', u'can', u'be', u'read', u'on', u'the', u'train', u'?']
Is this train moving?
 [u'is', u'this', u'train', u'moving', u'?']
What is the color of the dog?
 [u'what', u'is', u'the', u'color', u'of', u'the', u'dog', u'?']
Is this a library?
 [u'is', u'this', u'a', u'book', u'?']
Can the dog read?
 [u'can', u'the', u'dog', u'read', u'?']

jiasenlu · 2016-03-07T19:01:06Z

I think replacing the tokens with UNK makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.

Yes, I agree. I think will this will help.

It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.

I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.

Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.

Yes, agree

It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.

Yeah, I also tried to do some experiment on paraphrase of Question, and we can discuss with this more if you are interested.

Jiasen

honnibal · 2016-03-08T22:08:06Z

I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.

I think the random vector seems potentially problematic. It could be like replacing the word with a random one from your vocabulary. You could get a vector that's close to or identical some common word. Maybe empirically it makes no difference. I'm always trying to replace experiments with intuition though :). I find I always have too many experiments to run, so I'm always trying to make these guesses.

I've made a pull request that gives you a spacy option on prepro.py, and another script to create the necessary data files. I know you're probably already running a lot of experiments, but I'd be interested to see th=5 with the paraphrasing, if you have time. I might have time to try a simple BoW experiment tomorrow, but I doubt I have time to set up Torch and run the full model.

honnibal · 2016-03-09T00:20:21Z

Example output at threshold 50 below. I expect much lower thresholds to perform better, but it's harder to see the paraphrasing working when relatively fewer tokens are replaced.

What kind of meals are there? [u'What', u'kind', u'of', u'meal', u'are', u'there', u'?']
Where is the arrow pointing? [u'Where', u'is', u'the', u'arrow', u'pointing', u'?']
Where is the rolling pin in the kitchen? [u'Where', u'is', u'the', u'roll', u'wire', u'in', u'the', u'kitchen', u'?']
How tall are the ceilings? [u'How', u'tall', u'are', u'the', u'ceiling', u'?']
Which hand has a mitt? [u'Which', u'hand', u'has', u'a', u'glove', u'?']
IS this woman wearing sneakers? [u'IS', u'this', u'woman', u'wearing', u'sneakers', u'?']
What is on the blond man's head? [u'What', u'is', u'on', u'the', u'blonde', u'man', u"'s", u'head', u'?']
Which front leg has more white? [u'Which', u'front', u'leg', u'has', u'more', u'white', u'?']
Bike or a car? [u'bike', u'or', u'a', u'car', u'?']
What is the dog watching over? [u'What', u'is', u'the', u'dog', u'watching', u'over', u'?']
58364 paraphrased/215375 (99.83% done)   
example processed tokens:
What is the table made of? [u'What', u'is', u'the', u'table', u'made', u'of', u'?']
Is the food napping on the table? [u'Is', u'the', u'food', u'asleep', u'on', u'the', u'table', u'?']
What has been upcycled to make lights? [u'What', u'has', u'been', u'sweater', u'to', u'make', u'lights', u'?']
Is this an Spanish town? [u'Is', u'this', u'an', u'English', u'town', u'?']
Are there shadows on the sidewalk? [u'Are', u'there', u'shadows', u'on', u'the', u'sidewalk', u'?']
What is in the top right corner? [u'What', u'is', u'in', u'the', u'top', u'right', u'corner', u'?']
Is it cold outside? [u'Is', u'it', u'cold', u'outside', u'?']
What is leaning against the house? [u'What', u'is', u'leaning', u'against', u'the', u'house', u'?']
How many windows can you see? [u'How', u'many', u'windows', u'can', u'you', u'see', u'?']
Is this in a park? [u'Is', u'this', u'in', u'a', u'park', u'?']

dexter1691 closed this as completed Mar 7, 2016

dexter1691 reopened this Mar 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for NLP pre-processing and feature engineering #14

Ideas for NLP pre-processing and feature engineering #14

honnibal commented Mar 6, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

abhshkdz commented Mar 7, 2016

dexter1691 commented Mar 7, 2016

honnibal commented Mar 7, 2016

honnibal commented Mar 7, 2016

jiasenlu commented Mar 7, 2016

honnibal commented Mar 7, 2016

jiasenlu commented Mar 7, 2016

honnibal commented Mar 8, 2016

honnibal commented Mar 9, 2016

Ideas for NLP pre-processing and feature engineering #14

Ideas for NLP pre-processing and feature engineering #14

Comments

honnibal commented Mar 6, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

honnibal commented Mar 7, 2016

dhruvbatra commented Mar 7, 2016

abhshkdz commented Mar 7, 2016

dexter1691 commented Mar 7, 2016

honnibal commented Mar 7, 2016

honnibal commented Mar 7, 2016

jiasenlu commented Mar 7, 2016

honnibal commented Mar 7, 2016

jiasenlu commented Mar 7, 2016

honnibal commented Mar 8, 2016

honnibal commented Mar 9, 2016