Skip to content

9b. Manual deidentification

Shelley Staples edited this page Jul 8, 2022 · 9 revisions

Getting started

The deidentification tool is a downloadable and web-based application for Windows and Mac that provides an interface for deidentifying information from texts such as names, titles, and other identifiable information. The interface allows users to replace identifying information with a tag such as <name>, <place>, etc. This tool is usually used after the automatic deidentification script has been run. The manual tool is necessary for ensuring names and other identifying information are removed, if this is a concern (e.g., you are planning to share data).

After deidentifying a file, users can then save the file for use in their corpus.

Installation

There are two ways to use the deidentification tool: online or offline.

For online use, simply visit this link.

For offline use, download the zipped file from CIABATTA > manual_deidentification > interactive_deidentifying_tool.zip

Preparing files to be deidentified

Before running the deidentification tool, make sure that your texts are in .txt format, UTF-8 encoded, and cleaned of non-ASCII characters. You can use our Corpus Text Processor to perform these steps. We also recommend that you create an empty folder structure that duplicates the folder structure where you have stored your files.

Deidentifying a file

To use the tool, choose the file that you want to deidentify. The tool will highlight all words that are potential names (i.e., all capitalized words). Click on the word you wish to deidentify and choose the appropriate tag (e.g., <name>) from the menu on the right. Continue this process until you have deidentified the entire file. If you see a name that is not capitalized, you can click on that word and add the appropriate tag as well.

Saving the file

Once you have deidentified the file, click on "Download file" and navigate to where you would like to save the file. Note that the file now has the extension "_deidentified" added to your original filename.

Known limitations

In non-tokenized files, the replacement (e.g., <name>) will replace punctuation when the punctuation is part of the word (e.g., I talked to John. --- John. is replaced by <name> because John. is seen as one token).

Video presentation

A video version of this content is available on the Crow YouTube channel.

Video: Manual Deidentification

Video: Manual Deidentification

Navigating CIABATTA

Previous: 9a. Automatic deidentification

Next: Return to CIABATTA wiki front page