Corpus-Clustering-Project

Various text analysis applied on science-related texts in the COCA corpus

This project is aimed at analyzing science-related texts in the COCA corpus (Corpus of Contemporary American English). Currently, the magazine, academic, news sections in the corpus are used.
The data pipeline can be briefly described as follows:

Preprocesisng

sort text ids that are related to science
done by sorting excel sheet containing text id numbers and specifications

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
Data Pre-processing - Academic section.ipynb		Data Pre-processing - Academic section.ipynb
Data Pre-processing - Magazine section.ipynb		Data Pre-processing - Magazine section.ipynb
Data Pre-processing - Newspaper section.ipynb		Data Pre-processing - Newspaper section.ipynb
Doc2Vec.ipynb		Doc2Vec.ipynb
Filtering science-related texts from docs.ipynb		Filtering science-related texts from docs.ipynb
Preprocess & Parse.ipynb		Preprocess & Parse.ipynb
README.md		README.md
T-SNE clustering.ipynb		T-SNE clustering.ipynb
TF-IDF.ipynb		TF-IDF.ipynb
Word Vector Clustering (DBSCAN).ipynb		Word Vector Clustering (DBSCAN).ipynb
Word Vector Clustering (K-Means).ipynb		Word Vector Clustering (K-Means).ipynb
Word Vector Clustering (Ward).ipynb		Word Vector Clustering (Ward).ipynb
Word2Vec(NOUNS only).ipynb		Word2Vec(NOUNS only).ipynb
Word2Vec.ipynb		Word2Vec.ipynb
news_sci_text_excel.pkl		news_sci_text_excel.pkl
preprocess_and_parse.py		preprocess_and_parse.py
wordcloud_function		wordcloud_function
wordcloud_function.py		wordcloud_function.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus-Clustering-Project

About

Releases

Packages

Languages

PolarBear77/Corpus-Clustering-Project

Folders and files

Latest commit

History

Repository files navigation

Corpus-Clustering-Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages