-
Notifications
You must be signed in to change notification settings - Fork 6
1. Best practices for corpus building
- What is a corpus?
- What kinds of corpora are there?
- What kinds of searches can you do with a corpus?
- Do I need to build a corpus or can I use an existing one?
- How big does my corpus need to be?
- If I decide to build a corpus, where do I start?
- Additional resources
- Video presentation
A corpus (plural: corpora) is a principled collection of texts (e.g., news language, academic research articles, conversations) that are stored electronically. But this isn’t just a random collection, it’s shaped to represent the type of language you want to explore. That way a corpus can be used to answer questions about language. To put it very simply, a corpus is a collection of language that is used to describe some aspect of language use or to answer a question about language use. For example, we can find out:
- Which verbs and tenses are most common when presenting an argument vs. telling a story
- Whether the use of adjectives varies across genders
- Which transition words are most common in different genres
There are two types of corpora:
- General corpora that represent a general aspect of language use (e.g., COCA)
- Specific corpora that represent a particular slice of language use (e.g., Crow)
There are also two main ways to access corpora:
- Online/web-based interfaces (Crow; MICUSP; COCA) where the corpus is online and is explored by tools from the website
- Offline corpora that you use on your computer and interact with corpus software to explore or run your own computer programming scripts
Corpora can include both spoken and written language.
There are many ways that we can search a corpus to find answers to research questions or to develop teaching materials. Here are four common types of corpus searches:
- Frequency lists with different ways to sort: to see most frequent words
- Keyword(s) in context (KWIC): to see the company a word or words keep
- Wildcard searches: to see different forms of a word
- N-grams or clusters/groups of words: to see groups of words that go together
Below we show examples of each of these four types of searches with screenshots from the freeware program AntConc.
It’s always best to use an existing corpus if it represents the language you are interested in exploring. Here are some corpora that might be of use for you:
- British Academic Written English Corpus (BAWE)
- British National Corpus (BNC)
- Corpus of Contemporary American English (COCA)
- Corpus and Repository of Writing (Crow)
- International Corpus of Learner English (ICLE)
- Louvain International Database of Spoken English Interlanguage (LINDSEI)
- Michigan Corpus of Academic Spoken English (MICASE)
- Michigan Corpus of Upper-Level Student Papers (MICUSP)
The size of the corpus you need depends on the types of questions you are trying to answer.
A small corpus can be great for classroom research, to answer questions such as:
- What words will students need to be familiar with in order to read biology or engineering texts?
- How are my students using transitions or time markers in their writing?
A large corpus is essential if you want to capture all the variation across different aspects of language use, to answer questions such as:
- What are the linguistic characteristics of travel blogs?
- What language is used in introductory biology textbooks?
- How do people express disagreement?
You've come to the right place! The rest of this wiki describes the steps you'll want to take to build a corpus that meets your needs.
Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O’Keefe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31-37). Routledge.
Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.
A video version of this content is available on the Crow YouTube channel.
Video: Best practices for corpus building
Previous: Home
Next: 2. CIABATTA overview
CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising
See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.