Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data preprocessing details #4

Open
artidoro opened this issue Apr 6, 2022 · 4 comments
Open

Data preprocessing details #4

artidoro opened this issue Apr 6, 2022 · 4 comments

Comments

@artidoro
Copy link

artidoro commented Apr 6, 2022

Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.

It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.

Thank you for your help!

@BenjaminDHorne
Copy link
Member

BenjaminDHorne commented Apr 6, 2022 via email

@artidoro
Copy link
Author

artidoro commented Apr 7, 2022

Thank you for the information!

Following up, when you add the "@" signs, what tokenizer do you use?
Are you splitting on white spaces or do you use ntk or spacy?

@artidoro
Copy link
Author

artidoro commented Apr 7, 2022

It seems like the text you provide is tokenized, for example: "Here 's", and "it ’ s" have spaces. Also, there are spaces between words and punctuation which are not stylistically common. Do you have any hunch on how these things came to be?

@mgruppi
Copy link
Member

mgruppi commented Apr 7, 2022

Hi @artidoro. After looking at your questions I believe there might be some points that need clarification.
Yes, you are right, there is some tokenization happening, we need that to replace words with @. We apply NLTK's word_tokenize to the raw input text. Hope this clarifies it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants