-
Notifications
You must be signed in to change notification settings - Fork 6
9. Deidentifying your data
Home > Adding headers and filenames > Deidentifying your data
Deidentifying your data is usually the last step in preparing your files for a corpus that you want to share with others. We have developed a two-step process of the deidentification. As a first step (automatic deidentification), we run a Python script that removes proper names and other identifying information outside the body of the students’ texts. Many proper names also occur in the texts themselves (especially certain assignments, such as reflections), which is usually something that needs to be deidentified manually. For the second step of the deidentification process (manual deidentification), we have developed a tool that helps with the manual deidentification by highlighting capitalized words. If you are not comfortable running the python script, you can use the manual deidentification tool on its own.
CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising
See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.