Skip to content

9. Deidentifying your data

Shelley Staples edited this page Nov 16, 2021 · 7 revisions

Home > Adding headers and filenames > Deidentifying your data

Deidentifying your data

Deidentifying your data is usually the last step in preparing your files for a corpus that you want to share with others. We have developed a two-step process of the deidentification. As a first step (automatic deidentification), we run a Python script that removes proper names and other identifying information outside the body of the students’ texts. Many proper names also occur in the texts themselves (especially certain assignments, such as reflections), which is usually something that needs to be deidentified manually. For the second step of the deidentification process (manual deidentification), we have developed a tool that helps with the manual deidentification by highlighting capitalized words. If you are not comfortable running the python script, you can use the manual deidentification tool on its own.

9a. Automatic deidentification

9b. Manual deidentification