Skip to content

New Python/NLP project used to analyze the headlines from American newsites and how people tweet about headlines

Notifications You must be signed in to change notification settings

AvrahamKahan123/headlinesAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

headlinesAnalyzer

Currently incomplete (~75% done). News and Twitter analyzer using python Natural Language Processing Modules and ElasticSearch. Attempts to create over time a tracker for topics, how many articles there are about that topic, and how people are tweeting about that topic

explanation

Flow chart for program can be found by clicking by viewing flow_chart.pdf See overview.txt for full explanation of program structure module by module and explanation of design decisions This program attempts to understand the interaction between the news, news sites and social media. First, it scrapes several news-sites using BeautifulSoup and saves these results to a PostgreSQL db. The programs are then parsed and analyzed with the help of the spaCy machine learning module and precreated postgres tables to extract related names, places, and organizations. Then, the program extracts topics with the help of the Latent Dirchet Allocation algorithm from Scikit-learn. The topics are then indexed by elasticsearch. The article headlines are then mapped to the topics by searching against the topics. Elasticsearch is used to do this, instead of just seeing how LDA constructed its topics, because Elasticsearch is more capable of providing similarity scores and determining whether a document truly belongs to a topic cluster. At this point the Twitter API is used to generate streams of tweets filtered on keywords in the topics (ex. "BLM", "Police" and "Portland" could be keywords for one topic). These tweets are then sentiment analyzed (ie. are the tweets positive or negative) with the help of the TextBlob module. These results are then stored in postgreSQL for easy retrieval whenever desired. New topics are generated from scratch every set time interval, and during that interval new headlines can be classified by searching their titles against the topics index The program makes use of a great deal of data to parse the Headlines for names and places. This data is stored mostly in postgreSQL tables. This data is neccesary since testing showed that Machine Learning modules are usually not very good at telling the difference between last names, organizations, and places (ex. in one test, "Biden" was characterized as a 'geopolitical entity') The program adds values and names to the database by scraping the web (this feature is complete for names already) automatically The program also indexes title names so to make them searchable. Signifigant modification to the schema has been done recently, so some of the code may as of now be illogical

current state

Most of the code to complete every individual task (parse the headlines, extract the topics with LDA, index the Topics, search the headlines against the topics, get the tweets with the Twitter API, extract places, people and organizations from the Headlines) is complete, yet they are not yet linked together to complete the pipeline. Extracting Proper nouns from the headlines still lacks some functionality (parsing abbreivations speicifically). Basic unit tests have verified some components. Currently the main focus is on adding unit/integration tests to assure proper functionality and adding more data to the database

current Issues

Finding sources of data for proper nouns is a big problem multithreading will be added, especially for DB queries when searching for proper nouns spaCy was having some issues, but after being reinstalled, has been working fine

About

New Python/NLP project used to analyze the headlines from American newsites and how people tweet about headlines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published