Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Sarcasm v2 Dataset #1

Open
mubaris opened this issue Sep 28, 2017 · 10 comments
Open

Use Sarcasm v2 Dataset #1

mubaris opened this issue Sep 28, 2017 · 10 comments

Comments

@mubaris
Copy link
Owner

mubaris commented Sep 28, 2017

Sarcasm v2 is a better dataset for this project. Since it has both parent comment and reply. Apply this dataset to make the prediction better.

@cagdasgerede
Copy link

v2 is a single csv file. I can write a python function to covert that file into the format learn.datasets.load_files expects?

For example, for the following data point:

Corpus,Label,ID,Quote
GEN,sarc,GEN_sarc_0000,First off, That's grade A USDA approved Liberalism in a nutshell.
GEN,notsarc,GEN_notsarc_1136.First

Programmatically

  1. I can create a file GEN_sarc_0000.txt which contains "First off, That's grade A USDA approved Liberalism in a nutshell.". I can create a file GEN_notsarc_1136.txt which contains "First".
  2. Then, I can put the file into container/sarc folder and container/notsarc respectively.

This way the current data loading can work as it is.

What do you think about this approach?

@mubaris
Copy link
Owner Author

mubaris commented Oct 1, 2017

v2 Dataset has columns Quote and Reply. That's why it's better than v1. If we have both parent comment and reply, I think our bot will have better accuracy.

Do not go down the method you proposed.

@cagdasgerede
Copy link

It sounds like you are describing a more substantial change. Then what are the steps of achieving what you propose? Since you label this as hacktoberfest, could you provide some more direction?

@cromagnonninja
Copy link

Can I work on this issue? What exactly are the problems or concerns regarding this issue at the moment?

@mubaris
Copy link
Owner Author

mubaris commented Dec 8, 2017

@Bhanu1911

Current Method - We generate features from a single text field to train the models.

The desired Method - v2 Dataset provides 2 text field - question and reply to it. We want to make new models based on these 2 inputs.

Hope this helps

@cromagnonninja
Copy link

Basically this means we have to start from the ground up - we now have to train a model for the replies too, if I'm not wrong? (I'll study the code and see how you trained the first time around.) Plan of action:

  1. Split the csv file into two parts, quote and reply.
  2. Train and test both post division
  3. Configure the bot to send only those replies which get a reasonably high accuracy from all algorithms.
    I believe that'll be the way to go?

@cromagnonninja
Copy link

Could you guide me as to how you created the dataset?

@mubaris
Copy link
Owner Author

mubaris commented Dec 8, 2017

@Bhanu1911 What I was thinking is little different.

  • Train the model with 2 inputs - quote and reply.
  • For a comment to be sarcastic on Reddit, we consider the comment(reply) and its parent comment(quote)

This makes sense because Sarcasm is context based. Having comment and its parent comment will be accurate than a single comment.

@mubaris
Copy link
Owner Author

mubaris commented Dec 8, 2017

I think the source gives enough background about how they created the dataset - Sarcasm v2

@cromagnonninja
Copy link

I meant how did you partition the dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants