-
Notifications
You must be signed in to change notification settings - Fork 6
Home
CIABATTA stands for "Corpus In A Box: Automated Tools, Tutorials, & Advising." It is a distillation of the accumulated knowledge of the Crow team, corpus researchers and software developers who maintain the Corpus & Repository of Writing.
CIABATTA provides templates and code for corpus building -- examples, design patterns, best practices, and step-by-step processes -- that provide a starting point for developing new corpora. The guides and guidelines included here can be used as-is, or can be extended to fit the particular needs of a given corpus.
We recommend not using Safari with the Wiki, tools, or GitHub more generally.
1. Best practices for corpus building
3. Ethical issues in corpus building
4. Checking consents and collecting data
6. Converting, encoding, and standardizing your data
6a. Automatic processing with our Corpus Text Processor
6b. Manually converting your data
7. Organizing, preparing and processing metadata
7a. Gathering and preparing metadata
7b. Running the metadata processing script
8. Adding headers and changing filenames
8a. Why add headers and filenames?
8b. Adding headers and changing filenames script
9a. Automatic deidentification
CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising
See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.