Skip to content

Latest commit

 

History

History
10 lines (8 loc) · 490 Bytes

README.md

File metadata and controls

10 lines (8 loc) · 490 Bytes

Corpus-Clustering-Project

Various text analysis applied on science-related texts in the COCA corpus

This project is aimed at analyzing science-related texts in the COCA corpus (Corpus of Contemporary American English). Currently, the magazine, academic, news sections in the corpus are used.
The data pipeline can be briefly described as follows:

  1. Preprocesisng
  • sort text ids that are related to science
  • done by sorting excel sheet containing text id numbers and specifications