shophine-p0

Word Count

Sub-project 1:

Pick top 40 words across all documents with largest word count

Drop words with total count less than 2 across all documents
Words should be case-insensitive
Store the output in sp1.json file

Sub-project 2:

Do the same as Sub-project 1 and also filter out words that are provided in stopword.txt file

Words in stopword.txt must be dropped
Words should be case-insensitive
Store the output in sp2.json file

Sub-project 3:

Remove the trailing punctuations from the word such that first or last character of the word is not a punctuation.

List of the punctuations .,:;'!?
Discard word with one character and then trim the word
Words in stopword.txt must be dropped
Words should be case-insensitive
Pick top 40 words
Store the output in sp3.json file

Sub-project 4:

Calcuate TF-IDF values for every word and pick top 5 words from each document Remove the trailing punctuations from the word such that first or last character of the word is not in the punctuation list.

Words should be case-insensitive
Words in stopword.txt must be dropped
Discard word with one character
Strip out leading or trailing punctuation
Compute TF-IDF values
Output should have 5 * N entries where N is the no. of documents
Store the output in sp4.json file

Installation

Apache Spark
Conda
Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way.

Running the application

Activate the conda environment
```
conda activate
```
Start Spark
```
pyspark
```

Dataset

Dataset is available at data directory in the root folder of the project.

Authors

Shophine Sivaraja, Grad Student at the University of Georgia. For more information, please see CONTRIBUTORS.md

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md
sp1.json		sp1.json
sp2.json		sp2.json
sp3.json		sp3.json
sp4.json		sp4.json
sub-project-1.ipynb		sub-project-1.ipynb
sub-project-2.ipynb		sub-project-2.ipynb
sub-project-3.ipynb		sub-project-3.ipynb
sub-project-4.ipynb		sub-project-4.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

shophine-p0

Word Count

Sub-project 1:

Sub-project 2:

Sub-project 3:

Sub-project 4:

Installation

Running the application

Dataset

Authors

About

Releases

Packages

Languages

dsp-uga/shophine-p0

Folders and files

Latest commit

History

Repository files navigation

shophine-p0

Word Count

Sub-project 1:

Sub-project 2:

Sub-project 3:

Sub-project 4:

Installation

Running the application

Dataset

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages