This folder contains the pipeline instructions for preparing text corpora. Described are all steps needed from converting the original data to publishing the corpus in Korp or in the download service.
In addition, the file
corpus_publishing_tasklist.md
contains checklists for tasks in the corpus publishing pipeline that
can be copied to the description of a Jira ticket for publishing (a
version of) a corpus for keeping track of the progress of the
publication process.
The instructions are accessible through the GitHub browser
interface or in
a cloned
Kielipankki-utilities
Git repository (e.g., on Puhti). You should update your own copy of
the repository with git pull
to see the latest changes.
A third option would be to use the desktop client of GitHub.
The instructions are organized in several files, all stored in this subfolder docs
. The order of tasks can be seen from the checklists.
The instructions are written in Markdown format. (For more information please see: https://help.github.com/en/articles/about-writing-and-formatting-on-github.) The browser interface of GitHub displays this nicely and makes it easy to read and edit the text.
Please feel free to give feedback, correct and edit the instructions where needed, and add what you think is missing. You can also add new files. You might find placeholders in the instructions for still missing information (e.g. a guideline for testing text corpora in Korp) and of course everybody is welcome to fill them.