Sentiment analysis on NYT times immigration data with VADER and Textblob dictionaries
Poster Reference to project: https://scholar.valpo.edu/cus/917/
- R Studio
- Jupyter Notebooks (python)
Folder: 1-NYT API Data Extraction
-
Created R notebook
-
Extract articles in a table like structure using https://www.storybench.org/working-with-the-new-york-times-api-in-r/
-
Extracted immigration articles using query of "migrant OR immigration OR immigrant OR migration OR refugee OR alien OR undocumented OR asylum" to get a wide range of immigration as a whole.
-
Raw data from 1981-2020 is present in Exctracting NYT API (allNYTSearch1981to2020). Articles were extracted from the last date of requests until the API had a max number off calls. This was manually done and it usually failed every 2 years of data per the query specified in the notebook.
Folder: 2-Standarization and Training
- Drop duplicate articles.
- Data Standarization: Normalizing lead_paragraph with preprocess_regex function
- Textblob and VADER assign a first score between -1 to 1 to identify the polarity of the first paragraph using the default dicctionaries.
Folder 3- Recoding words by hand
- Identify problematic words and n-grams counter parts such as "united" which most of them belonged to United States
- Recoded individual words in the correct sphere using inter-reliability measures to allocate words in the correct sphere
Folder 4- Retrain and Filter
- allocate words identified in step 3 into the correct sphere by VADER since it was identified as the better dicctionary
- Filter latino news using query specified in folder "latino query"
- Filter news that were written in the United States
- Normalize scores from -1-1 to 0-100 (in percent)
- Computed the Yearly, Quartely and Monthly mean of normalized scores
- Time series plots of normalized sentiment scores for All Articles, All Latino Articles, USA articles and USA Latino Articles
Folder 5- Final Sentiment output
- Output retrained VADER scores for all the data (NYT_data_1980_to_2020_Retrained)
- Output articles that are just in the United States (US_News_Articles)
- Output articles that are Latino immigration news that appear in the United States (US_Latino_News_Articles)
Extra Analysis for further research implementation in order to aggregate monthly, yearly and quarterly data
- Build a Machine learning model on retrained VADER data
- Web application in order to predict new articles from retrained data and create a service that ingests new articles via the NYT time API