Skip to content

8b. Adding headers and changing filenames script

Shelley Staples edited this page Nov 30, 2021 · 8 revisions

Contents

Required installations

Before you try to run this script, make sure you have the following installed:

If you have Anaconda installed If you have installed Python another way
conda install pandas pip install pandas
conda install xlrd pip install xlrd
conda install pyyaml pip install pyyaml

Downloading the script and the files

We have included a folder with some sample files for you to run the script on. Both the folder and the script are inside ciabatta > metadata. The folder is called standardized, and the Python script is called ciabatta_headers.py. Inside the standardized folder there are three course subfolders: 101, 102 and 10600. When you run the script, you need to specify a course subfolder.

Here's what the folder structure inside the ciabatta folder looks like:

There are two ways (a and b below) you can download them.

a) From the git website: Navigate to the ciabatta directory, then in the upper right corner click on the "Code" button and select “Download zip”. This will download the zip file on your computer. Then unzip the file (Windows users: ensure you unzip the file), and you will have the script with the folder on your computer.

b) From the terminal: Navigate to the ciabatta directory, then in the upper right corner click on the "Code" button and copy the link. Now navigate to your terminal on a Mac (in Windows, use Command Prompt or Powershell) and run this line:

git clone https://github.com/writecrow/ciabatta.git

This will download the git directory with the script and the files on your computer.

Running the script on Mac

Adding metadata to your files should come after your corpus files have been converted to .txt, encoded into UTF-8, and standardized to ascii characters (only for English). To do this, use the Corpus Text Processor.

Before you run the script, check how many files there are in the folder standardized/101 to make sure that it is the same number of files after you run the script. To count the number of files, run the following command:

ls standardized/101/**/**/*.txt | wc -l

After you’ve downloaded the ciabatta folder (See Downloading the script and files), navigate to metadata subfolder with this command:

cd metadata

When you’re inside the metadata subfolder, you’re ready to run the script. As a reminder, to run the script, you will need two components: a folder with your corpus in .txt files and a spreadsheet with metadata. Here is the command to run the script:

python ciabatta_headers.py --directory=standardized/101 --master_file=metadata_folder/master_student_data.xlsx

Now check how many files are in the new folder files_with_headers. To count the number of files, run the following command:

ls files_with_headers/**/**/**/**/*.txt | wc -l

Video presentation for Mac

A video version of this content is available on the Crow YouTube channel.

Video: Running the script on Mac

Running the script on Windows

Adding metadata to your files should come after your corpus files have been converted to .txt, encoded into UTF-8, and standardized to ascii characters (only for English). To do this, use the Corpus Text Processor.

After you’ve downloaded the ciabatta folder (See Section … on how to do that), navigate to metadata subfolder with this command:

cd metadata

When you’re inside the metadata subfolder, you’re ready to run the script. Before you run the script, check how many files there are in the folder standardized/101 to make sure that it is the same number of files after you run the script. To count the number of files, run the following command:

ls standardized/101/**/**/*.txt | Measure-Object -Line

As a reminder, to run the script, you will need two components: a folder with your corpus in .txt files and a spreadsheet with metadata. Here is the command to run the script:

python ciabatta_headers.py --directory=standardized\101 --master_file=metadata_folder\master_student_data.xlsx

where directory is the place where you saved your files, and master_file is the path to your metadata spreadsheet. Now your metadata folder should have a new folder called files_with_headers. Let’s first run the ls command to see if it is there and then visually inspect the folder to make sure that the files have new filenames and headers. Also, check how many files are in the new folder files_with_headers. To count the number of files, run the following command:

ls files_with_headers/**/**/**/**/*.txt | Measure-Object -Line

Video presentation for PC

A video version of this content is available on the Crow YouTube channel.

Video: Running the script on PC

Navigating CIABATTA

Previous: 8a. Why add headers and filenames?

Next: 9. Deidentifying your data