9. Deidentifying your data

Deidentifying your data is usually the last step in preparing your files for a corpus that you want to share with others. We have developed a two-step process of the deidentification. As a first step, outlined in 9a. Automatic deidentification, we run a Python script that removes proper names and other identifying information outside the body of the students’ texts. Many proper names also occur in the texts themselves (especially certain assignments, such as reflections), which is usually something that needs to be deidentified manually. For the second step of the deidentification process, outlined in 9b. Manual deidentification, we have developed a tool that helps with the manual deidentification by highlighting capitalized words. If you are not comfortable running the python script, you can use the manual deidentification tool on its own.

Navigating CIABATTA

Previous: 8b. Adding headers and changing filenames script

Next: 9a. Automatic deidentification

CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising

See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9. Deidentifying your data

Navigating CIABATTA

Clone this wiki locally