Skip to content

6. Converting, encoding, and standardizing your data

Shelley Staples edited this page Nov 24, 2021 · 13 revisions

Contents

Overview

If you want to use your offline corpus with any of the following:

  • Concordancing software (e.g., AntConc, Sketch Engine, WordSmith, LancsBox)
  • Programming languages (e.g., R, Python)
  • PoS taggers and dependency parsers

you typically need to complete the following steps:

  • Convert: from formatted text (e.g., MS Word documents) to plain text
  • Encode: from different types of encoding (e.g., Windows-1251 or Cyrillic encoding) to the broadly used UTF-8 encoding
  • Standardize: transforming all variations of the same character (e.g., quotation marks can be “, ", `` or «) to just one character (for the quotation marks example, everything is converted to ").

Converting files from their original format to txt

If you collect data from participants (e.g., student texts), they typically come in MS Word or PDF formats. MS Word and PDFs are rich text formats, meaning that they contain information about fonts, color, etc. They could also contain images and hypertext links.

In contrast to the rich format files, there are files that are in plain text format (e.g., .txt, .csv, HTML, XML, etc.). The plain text formats are independent from programs that require special encoding, which makes them more machine readable. Thus, files in rich text formats (i.e., which contains formatting) need to be converted to plain text format. The plain text format we use is .txt because the .txt files are generally smaller in size and do not require additional markup like HTML and XML files.

Encoding to UTF8

The characters in text files can be encoded in a variety of ways. To ensure that the right characters are displayed on all computers, all your files need to be encoded in Unicode UTF-8, which is a universal character set. UTF8 is the most broadly used character encoding.

You can run a command to check the encoding of your files:

On a Mac: file -i folder_name/*.txt

Standardizing non-ASCII characters and remove non-English characters

This step standardizes the characters even further to ASCII by removing all non-English characters. ASCII (pronounced as ASS-kee) is an abbreviation from American Standard Code for Information Interchange, is a character encoding standard for the English language.

If your corpus is in a language other than English, you should not use this step. Even if the language of your corpus uses the Roman alphabet but, let’s say, it has additional characters such as diacritics (e.g., áàãäâåāăąǎȃȧ), this step will flatten/remove all the diacritics (e.g., a).

In addition to removing non-English characters, at this step you also want to standardize non-ASCII characters. For example, word processors often insert their only special versions for a few of the standard characters such as ' and ". These characters referred to as smart quotes (e.g., \u2018 and \u2019) should be replaced with ASCII versions of these characters.

Navigating CIABATTA

Previous: 5. Organizing your data

Next: 6a. Automatic processing with our Corpus Text Processor