9. Deidentifying your data

Home > Adding headers and filenames > Deidentifying your data

Deidentifying your data

Deidentifying your data is usually the last step in preparing your files for a corpus that you want to share with others. We have developed a two-step process of the deidentification. As a first step (automatic deidentification), we run a Python script that removes proper names and other identifying information outside the body of the students’ texts. Many proper names also occur in the texts themselves (especially certain assignments, such as reflections), which is usually something that needs to be deidentified manually. For the second step of the deidentification process (manual deidentification), we have developed a tool that helps with the manual deidentification by highlighting capitalized words. If you are not comfortable running the python script, you can use the manual deidentification tool on its own.

9a. Automatic deidentification

9b. Manual deidentification

CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising

See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9. Deidentifying your data

Deidentifying your data

Clone this wiki locally