Data preprocessing details #4

artidoro · 2022-04-06T21:27:57Z

Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.

It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.

Thank you for your help!

BenjaminDHorne · 2022-04-06T22:35:40Z

Hi, There are no preprocessing steps. The data for each outlet is directly from thier RSS feed. If there are any artifacts, then it is from the outlet itself, not the collection process. Ben

…

On Wed, Apr 6, 2022, 5:28 PM Artidoro Pagnoni ***@***.***> wrote: Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example. It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets. Thank you for your help! — Reply to this email directly, view it on GitHub <#4>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCLQZQM4JKV4OJ74B7EI2DVDX6WPANCNFSM5SXLS7BA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

artidoro · 2022-04-07T19:35:53Z

Thank you for the information!

Following up, when you add the "@" signs, what tokenizer do you use?
Are you splitting on white spaces or do you use ntk or spacy?

artidoro · 2022-04-07T20:01:28Z

It seems like the text you provide is tokenized, for example: "Here 's", and "it ’ s" have spaces. Also, there are spaces between words and punctuation which are not stylistically common. Do you have any hunch on how these things came to be?

mgruppi · 2022-04-07T21:17:23Z

Hi @artidoro. After looking at your questions I believe there might be some points that need clarification.
Yes, you are right, there is some tokenization happening, we need that to replace words with @. We apply NLTK's word_tokenize to the raw input text. Hope this clarifies it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data preprocessing details #4

Data preprocessing details #4

artidoro commented Apr 6, 2022

BenjaminDHorne commented Apr 6, 2022 via email

artidoro commented Apr 7, 2022

artidoro commented Apr 7, 2022

mgruppi commented Apr 7, 2022

Data preprocessing details #4

Data preprocessing details #4

Comments

artidoro commented Apr 6, 2022

BenjaminDHorne commented Apr 6, 2022 via email

artidoro commented Apr 7, 2022

artidoro commented Apr 7, 2022

mgruppi commented Apr 7, 2022