Various text analysis applied on science-related texts in the COCA corpus
This project is aimed at analyzing science-related texts in the COCA corpus (Corpus of Contemporary American English).
Currently, the magazine, academic, news sections in the corpus are used.
The data pipeline can be briefly described as follows:
- Preprocesisng
- sort text ids that are related to science
- done by sorting excel sheet containing text id numbers and specifications