-
Notifications
You must be signed in to change notification settings - Fork 6
9b. Manual deidentification
- Getting started
- Installation
- Preparing files to be deidentified
- Deidentifying a file
- Saving the file
- Known limitations
- Video presentation
The deidentification tool is a downloadable and web-based application for Windows and Mac that provides an interface for deidentifying information from texts such as names, titles, and other identifiable information. The interface allows users to replace identifying information with a tag such as <name>, <place>, etc. This tool is usually used after the automatic deidentification script has been run. The manual tool is necessary for ensuring names and other identifying information are removed, if this is a concern (e.g., you are planning to share data).
After deidentifying a file, users can then save the file for use in their corpus.
There are two ways to use the deidentification tool: online or offline.
For online use, simply visit this link.
For offline use, download the zipped file from CIABATTA > manual_deidentification > interactive_deidentifying_tool.zip
Before running the deidentification tool, make sure that your texts are in .txt format, UTF-8 encoded, and cleaned of non-ASCII characters. You can use our Corpus Text Processor to perform these steps. We also recommend that you create an empty folder structure that duplicates the folder structure where you have stored your files.
To use the tool, choose the file that you want to deidentify. The tool will highlight all words that are potential names (i.e., all capitalized words). Click on the word you wish to deidentify and choose the appropriate tag (e.g., <name>) from the menu on the right. Continue this process until you have deidentified the entire file. If you see a name that is not capitalized, you can click on that word and add the appropriate tag as well.
Once you have deidentified the file, click on "Download file" and navigate to where you would like to save the file. Note that the file now has the extension "_deidentified" added to your original filename.
In non-tokenized files, the replacement (e.g., <name>) will replace punctuation when the punctuation is part of the word (e.g., I talked to John. --- John. is replaced by <name> because John. is seen as one token).
A video version of this content is available on the Crow YouTube channel.
Video: Manual Deidentification
Previous: 9a. Automatic deidentification
CIABATTA: Corpus in a Box: Automated Tools, Tutorials, & Advising
See a problem in this wiki? Report an issue. Unsure how to report using GitHub? Get help reporting.