Skip to content

6b. Manually converting your data

Shelley Staples edited this page Nov 30, 2021 · 7 revisions

Contents

Overview

If your data has links to websites, images, or other multimedia that you want to retain references to, you may need to manually convert it into .txt format. While our Corpus Text Processor and other similar programs will automatically convert files to txt., it will not retain references to images or other multimedia. Manually converting will allow you to provide manual annotations of this multimedia in the text (e.g., replacing an image with a tag <picture>). In addition, some file types save text as images (as opposed to those having OCR capability) and thus automatic conversion will not allow you to retain text either.

Manually converting

Open and set up converting applications/tools

  • On a PC, open a blank .txt file by opening Notepad++. We recommend Notepad++ rather than Notepad, the default Windows program, as Notepad uses a Windows proprietary encoding as its default.
  • On a Mac open a blank Textedit document. When you open the application, go to “Preferences” under the “TextEdit” tab and make sure your preferences are marked as below (be sure to uncheck smart quotes and smart dashes -- see a screenshot below). Unchecking these features will ensure that your text file uses standard ASCII characters.

Add text to applications/tools

Once you have opened Notepad++ or Textedit, the next step is to copy and paste the original file contents to the application. You can copy all text including headings from the original files. It is easier to do this paragraph by paragraph rather than selecting the entire text. Text should be copied from top to bottom, left to right, the way it would be read. After copying, it is a good idea to check your texts to see if everything was copied and pasted correctly. As mentioned above, some file formats do not store text in a readable format but rather as an image. In this case, you can either create a version of the file that allows for optical character recognition, or you can retype the text in your open Notepad++ or Textedit file.

Annotate multimedia

Once you have the text contents in the .txt format, you can add manual annotations to multimedia. The common multimedia that you may want to annotate include pictures, videos, and hyperlinks. You can use angled brackets for your annotations (e.g., <picture>) where pictures appear in the text. It is important to use angled brackets <> or square brackets [] (be consistent), so that you can exclude them from other computational processes, such as searches within a concordancing program, or taggers/parsers. Make sure to leave spaces around the angled brackets. Similarly, if your targets are videos, links, tables, etc., you can type <video from external source>, <link to external source>, <table from external source>, etc. where these elements occur within the text itself.

Your annotations can be as generic or specific as you wish them to be. For example, in developing the Crow corpus, we often encounter portfolios that contain links to specific major projects that students have written. We can indicate which major project the student references in the text by using “<link to literacy narrative>” where the student writer has linked it to their reflective text.

If the original files contain non-Roman characters, they will appear as (??) in .txt files when automatically converted, since these will be non-ASCII characters. You can replace them with the language provided, e.g., <Chinese> or <Arabic>. If the entire text has been translated into another language than the primary language used in your corpus, you may want to indicate this as well, e.g. <Chinese translation of entire text>.

Video presentation

A video version of this content is available on our YouTube channel

Manually converting your data

Video: Manually converting your data

Navigating CIABATTA

Previous: 6a. Corpus Text Processor

Next: 7. Organizing, preparing and processing metadata