State of the Union Analysis with PySpark

Authored by:
Ashutosh Pathak and Chaithra Bekal

Overview

This PySpark program analyzes the State of the Union (SOTU) addresses from 2009 to 2021 to extract valuable insights such as word frequency, interesting word pairs, and word lift. It performs text preprocessing, data analysis, and saves the results to text files.

Dependencies

PySpark: The program requires Apache Spark with PySpark to process large-scale data efficiently.
HDFS: Hadoop Distributed File System is used to read the input text file and save the results.

Usage

Ensure you have Apache Spark with PySpark installed and configured properly.
Place the SOTU text file (e.g., stateoftheunion1790-2021.txt) in your HDFS directory or modify the file path in the code accordingly.
Update the stopwords list in the code if needed. The provided list is a basic set of common English stopwords.
Run the program using the following command:
```
spark-submit state_of_the_union_analysis.py
```

The program will perform the following steps:

Read the SOTU text file.
Preprocess the text, including cleaning and extracting years.
Analyze word frequencies, interesting word pairs, and word lift.
Save the results to text files in HDFS.

After the program completes, you can access the following result files:

interesting_words_sample.txt: Words that appear with high frequency in the years following each window.
pairs_sample.txt: Frequent word pairs in the sentences that appear > 10 times together.
high_lift_pairs.txt: Word pairs with lift values > 3.0.

Results

The program was tested on the entire SOTU text file, downloaded directly from the website. The results for the first two tasks are available in the interesting_words_sample.txt and pairs_sample.txt folders, respectively. The results for the last task with the entire text are available in the high_lift_pairs.txt folder.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
results		results
README.md		README.md
stateoftheunion1790-2021.txt		stateoftheunion1790-2021.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

State of the Union Analysis with PySpark

Overview

Dependencies

Usage

The program will perform the following steps:

Results

About

Releases

Packages

Languages

pathak-ashutosh/sotu-analysis

Folders and files

Latest commit

History

Repository files navigation

State of the Union Analysis with PySpark

Overview

Dependencies

Usage

The program will perform the following steps:

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages