Laboratory notebook (LabBook) of an example project for doing QuickSort

Documentation

Introduction

A paramount asset of the methodology is the laboratory notebook (labbook), similar to the ones biologist, chemists and scientist from other fields use on a daily basis to document the progress of their work. This notebook is a single file inside the project repository, shared between all collaborators. The main motivation for keeping a labbook is that anyone, from original researchers to external reviewers, can later use it to understand all the steps of the study and potentially reproduce and improve it. This self-contained unique file has two main parts. The first one aims at carefully documenting the development and use of the researchers’ complex source code. The second one is concerned with keeping the experimentation journal.

Explanation of “Documentation” part

This part serves as a starting point for newcomers, but also as a good reminder for everyday users. The labbook explains the general ideas behind the whole project and methodology, i.e., what the workflow for doing experiments is and how the code and data are organized in folders. It also states the conventions on how the labbook itself should be used. Details about the different programs and scripts, along with their purpose follow. These information concern the source code used in the experiments as well as the tools for manipulating data and the analysis code used for producing plots and reports. Additionally, there are a few explanations on the revision control usage and conventions. Moreover, this part of the labbook contains a few examples how to run scripts, illustrating the most common arguments and format. Although such information might seem redundant with the previous documentation part, in practice such examples are indispensable even for experienced users, since some scripts have lots of environment variables, arguments and options. It is also important to keep track of major changes to the source code and the project in general inside a ChangeLog. Since all modifications are already captured and commented in Git commits, the log section offers a much more coarse-grain view of the code development history. There is also a list with a brief description of every Git tag in the repository as it helps finding the latest stable or any other specific version of the code.

Explanation of “Experiment results” part

All experiments should be carefully noted in this part, together with the key input parameters, the motivation for running such experiment and the remarks on the results. For each experimental campaign there should be a new entry that answers to the questions “why”, “when”, “where” and “how” experiments were run and finally what the observations on the results are. Inside the descriptive conclusions, Org-mode allows for using both links and git-links connecting the text to specific revisions of files. These hyperlinks point to crucial data and analysis reports that illustrate a newly discovered phenomenon.

Additional literature and useful links

This simple example project was hugely inspired by Arnaud Legrand’s Master course on performance evaluation and experimental methodology. The motivation and explanation of the methodology used for this project is explained in more details in this journal paper and PhD thesis of Luka Stanisic. To see how this methodology helped performing a much larger study concearning modeling and simulation of HPC applications running on top of dynamic task-based runtimes, go to the following webpage.

Handling large data files

Sometimes outputs of the experiments can be very large (hundreds of MB) and adding them to Git repository is a bit problematic. Git was not made for handling large files and also pulling a large repository of several GBs takes a lot of time. Thus, it is advised to use supplementary tools that help dealing with large files inside Git, such as Git-LFS and git-annex. Additionally, Git-LFS is already integrated into GitHub and GitLab, so using it shouldn’t be that hard.

Experimental workflow

Create a new branch for doing experiments
Make sure everything is commited
Run run_xp script with the desired parameters
Add obtained new experiments results to the git
Do the analysis
Add to the labbook (this file) inside “* Experiment results” a new section describing conditions in which experiment was performed and the major notes about the results
Commit/push all files in the experimental branch
Merge this new branch with the remote “data” branch

Code and its organization

scripts:

Scripts for running the experiments and collecting the data are written directly in this file. Later they should be extracted from it using Org-mode’s tangling mechanism. If you are not so common with Org-mode, do not hesitate to write this scripts in regular bash (or python) files separately from laboratory notebook. In such case ignore the next first subsection and all Org-babel related things in the following section-inspect just the shell code inside the blocks.

Configuring Emacs for Org-mode tangling (if you have chosen the true path)

For writing, tangling the current file and making all scripts executable an easy process, I added the following shortcuts to my Emacs configuration.

(global-set-key (kbd "C-c m") (lambda () (interactive) (org-babel-tangle-file buffer-file-name) (shell-command  "chmod +x *.sh") ))

Also section properties are used to define in which files the code will be tangled.

run_xp.sh

Written in Org-mode and later tangled in a shell script.
Used for running the experiment in the right conditions, also calling the get_info script.
It will also verify that the source code is commited and that you are in the right branch, before running the experiments.

Initial variables, help and their parsing.

Bash preamble.

#!/bin/bash
# Script for running parallel quicksort comparison

set -e # fail fast

Defining all variables that will be used in the script.

# Script parameters
basename="$PWD"
host="$(hostname | sed 's/[0-9]*//g' | cut -d'.' -f1)"
datafolder=""
dataname="QuicksortData"
srcfolder="$basename/src"

# DoE parameters

# Reading command line arguments

# Choosing which program to run (in case this script is a wrapper for multiple programs)
programname="./src/parallelQuicksort"

# Program options
testing=0
recompile=0
verbose=0

Writing the help output, to help users invoke the script.

help_script()
{
    cat << EOF
Usage: $0 options

Script for running StarPU with benchmarking

OPTIONS:
   -h      Show this message
   -t      Enable testing
   -c      Recompile source code
   -v      Verbose output in the terminal
EOF
}
# Parsing options
while getopts "tcvh" opt; do
    case $opt in
	t)
	    testing=1
	    ;;
	c)
	    recompile=1
	    ;;
	v)
	    verbose=1
	    ;;
	h)
	    help_script
	    exit 4
	    ;;
	\?)
	    echo "Invalid option: -$OPTARG"
	    help_script
	    exit 3
	    ;;
    esac
done

Getting input parameter (number of elements to sort) from the command line argument.

shift $((OPTIND - 1))
RANGE1="-1"
RANGE2="-1"
if [[ $# == 1 ]]; then
   RANGE1=$1
fi
if [[ $# == 2 ]]; then
   RANGE1=$1
   RANGE2=$2
fi
if [[ $# > 2 ]]; then
    echo 'ERROR-more than two input parameters'
    help_script
    exit 2
fi

Verification

Verifying that everything is commited (if this is not a simple test), that we are in the right branch.

# Doing real experiments, not just testing if the code is valid and can be executed
if [[ $testing == 0 ]]; then
    branch=$(eval basename $(git symbolic-ref HEAD))
    echo "Now you are in git branch: ${branch}"
# Checking the name of the branch is not master
    if [[ "$branch" == *master* ]]; then
	echo "ERROR-experiments can be done only in specific xp branch!"
	echo "Use -t option for testing"
	exit 2
    fi
# Checking if everything is commited
    if git diff-index --quiet HEAD --; then
	echo "Everything is commited"
    else
	echo "ERROR-need to commit before doing experiment!"
	git status
	exit 1
    fi
fi

For real experiments (not tests), data folder will take the name from the branch. This can be configured differently, but such a way is the easiest and completely automatic.

# Name of data folder where to store results if testing
if [[ $testing == 1 ]]; then
    dataname="Test"
    datafolder="$basename/data/testing"
else 
    datafolder="$basename/data/$branch"
fi
mkdir -p $datafolder

Giving original name with appended with a unique number to the output file.

# Producing the original name for output file
bkup=0
while [ -e $datafolder/$dataname${bkup}.org ] || [ -e $datafolder/$dataname${bkup}.org~ ]; do 
    bkup=`expr $bkup + 1`
done
outputfile="$datafolder/$dataname${bkup}.org"

Capturing metadata and compiling

Calling an external script to capture metadata.

# Starting to write output file and capturing all metadata
title="Experiment for parallel quicksort"
./get_info.sh -t "$title" $outputfile

Compiling the source code and capturing the compilation output.

# Compilation
echo "* COMPILATION" >> $outputfile
if [[ $recompile == 1 ]]; then
    echo "Cleaning the previously compiled code..."
    make clean -C $srcfolder > /dev/null # Decided not to capture the output of clean

    # Here for some more complex the configuration is needed
    # The options and the output of the ./configure should also be captured

    echo "** COMPILATION OUTPUT" >> $outputfile
    echo "Compiling..."
    make -C $srcfolder >> $outputfile
    # make install (when needed)

    echo "** SHARED LIBRARIES DEPENDENCIES" >> $outputfile
    ldd $programname >> $outputfile
else
    echo "# used previous compile" >> $outputfile    
fi

Design of experiments

Generating randomized input parameters for the experiment using R.

rm -f input_values
Rscript input_generator.R $RANGE1 $RANGE2

Running the experiment

Preparing and capturing the line with which the experiment will be executed.

# Prepare running options
running="$programname"

echo "* COMMAND LINE USED FOR RUNNING EXPERIMENT" >> $outputfile
echo $running >> $outputfile

Run program.

echo "Executing program..."
# Writing results
rm -f stdout.out
rm -f stderr.out
time1=$(date +%s.%N)
set +e # In order to detect and print execution errors

# In fact only the following line is actually executing the experiment
# everything else is a wrapper, but it is very important that it is captured
#############################################
for NUM in $(cat input_values)
do
  echo "Array size: $NUM" 1>> stdout.out
  eval $running $NUM 1>> stdout.out 2>> stderr.out
done
#############################################

set -e
time2=$(date +%s.%N)
echo "* STDERR OUTPUT" >> $outputfile
cat stderr.out >> $outputfile
if [ ! -s stdout.out ]; then
    echo "ERROR DURING THE EXECUTION!!!" >> stdout.out
fi
echo "* STDOUT OUTPUT" >> $outputfile
cat stdout.out >> $outputfile
if [[ $verbose == 1 ]]; then
    cat stderr.out
    cat stdout.out
fi

End and cleanup.

# Cleanup & End
echo "* ELAPSED TIME FOR RUNNING THE PROGRAM" >> $outputfile
echo "Elapsed:    $(echo "$time2 - $time1"|bc ) seconds" >> $outputfile
rm -f stdout.out
rm -f stderr.out
rm -f input_values
cd $basename

get_info.sh

Written in Org-mode and later tangled in a shell script.
Capturing all the necessary metadata.
Enlarge it (or change it) according to your specific needs.

Initial variables, help and their parsing.

Bash preamble.

#!/bin/bash
# Script to get machine information before doing the experiment

set +e # Don't fail fast since some information is maybe not available

Defining all variables that will be used in the script.

# Script parameters
title="Experiment results"
host="$(hostname | sed 's/[0-9]*//g' | cut -d'.' -f1)"

Writing the help output, to help users invoke the script.

help_script()
{
    cat << EOF
Usage: $0 [options] outputfile.org

Script for to get machine information before doing the experiment

OPTIONS:
   -h      Show this message
   -t      Title of the output file
EOF
}
# Parsing options
while getopts "t:h" opt; do
    case $opt in
	t)
	    title="$OPTARG"
	    ;;
	h)
	    help_script
	    exit 4
	    ;;
	\?)
	    echo "Invalid option: -$OPTARG"
	    help_script
	    exit 3
	    ;;
    esac
done

Getting output file name from the command line argument.

shift $((OPTIND - 1))
outputfile=$1
if [[ $# != 1 ]]; then
    echo 'ERROR!'
    help_script
    exit 2
fi

Collecting metadata

Preambule of the output file.

echo "#+TITLE: $title" >> $outputfile
echo "#+DATE: $(eval date)" >> $outputfile
echo "#+AUTHOR: $(eval whoami)" >> $outputfile
echo "#+MACHINE: $(eval hostname)" >> $outputfile
echo "#+FILENAME: $(eval basename $outputfile)" >> $outputfile
echo " " >> $outputfile

Collecting metadata.

echo "* MACHINE INFO" >> $outputfile

echo "** PEOPLE LOGGED WHEN EXPERIMENT STARTED" >> $outputfile
who >> $outputfile

echo "** ENVIRONMENT VARIABLES" >> $outputfile
env >> $outputfile

echo "** HOSTNAME" >> $outputfile
hostname >> $outputfile

if [[ -n $(command -v lstopo) ]];
then
    echo "** MEMORY HIERARCHY" >> $outputfile
    lstopo --of console >> $outputfile    
fi

if [ -f /proc/cpuinfo ];
then
    echo "** CPU INFO" >> $outputfile
    cat /proc/cpuinfo >> $outputfile    
fi

if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ];
then
    echo "** CPU GOVERNOR" >> $outputfile
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> $outputfile    
fi

if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ];
then
    echo "** CPU FREQUENCY" >> $outputfile
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> $outputfile    
fi


if [[ -n $(command -v nvidia-smi) ]];
then
    echo "** GPU INFO FROM NVIDIA-SMI" >> $outputfile
    nvidia-smi -q >> $outputfile    
fi 

if [ -f /proc/version ];
then
    echo "** LINUX AND GCC VERSIONS" >> $outputfile
    cat /proc/version >> $outputfile    
fi

if [[ -n $(command -v module) ]];
then
    echo "** MODULES" >> $outputfile
    module list 2>> $outputfile    
fi

Collecting revisions info.

echo "* CODE REVISIONS" >> $outputfile

git_exists=`git rev-parse --is-inside-work-tree`
if [ "${git_exists}" ]
then
    echo "** GIT REVISION OF REPOSITORY" >> $outputfile
    git log -1 >> $outputfile    
fi

svn_exists=`svn info . 2> /dev/null`
if [ -n "${svn_exists}" ]
then
   echo "** SVN REVISION OF REPOSITORY" >> $outputfile
   svn info >> $outputfile   
fi

git_revert.sh

This script is a wrapper around simple git revert. It will detect all source code commits inside xp# branch and it will revert them automatically. Thus, try to always keep separate commits for adding data files and changing source code.

Initial variables, help and their parsing.

Bash preamble.

#!/bin/bash
# Script for doing a revert all source modifications before merging "exp" branch with main "data" branch
# Shall always be called from "exp" branch

set -e # fail fast

Verification, commit detection and reverting

Doing revert only for experimental branches.

whole_list="list.out"
rm -f $whole_list
branch=$(eval basename $(git symbolic-ref HEAD))
echo "Now you are in git branch: ${branch}"
if [[ "$branch" == master || "$branch" == data ]]; then
    echo "ERROR: Revert should not be performed for master or data branch!"
    exit 2
fi

Making sure everything is commited.

if git diff-index --quiet HEAD --; then
    echo "Everything is commited"
else
    echo "Warning: It is better to commit everything before doing revert!"
fi

Finding commits that need reverting.

all_commits=$(git log --format=format:%H master..HEAD)
while IFS= read -r commit
do
    type=unknown
    f=$(git diff-tree --no-commit-id --name-only $commit)
    while IFS= read -r line 
    do
	case "$type,$line" in
	    "unknown,data") type=Data
		;;
	    "unknown,"*) type=Src
		;;
	    "Src,data") type=error
		echo "There is a commit with Src and Data together"
		exit 2
		;;
	    "Src,"*)
		;;
	    "Data,"*) type=error
		echo "There is a commit with Src and Data together"
		exit 3
		;;
	    *) type=internal_error
		;;
	esac
    done <<< "$f"
    echo -e "$type $commit" >> $whole_list
done <<< "$all_commits"

Showing all commits.

echo "All commits and their type:"
cat $whole_list

Reverting Src commits.

revert_list=$(cat $whole_list | grep "^Src" | cut -d' ' -f 2)
while IFS= read -r commit
do
    git revert -n $commit
done <<< "$revert_list"
echo "Revert before merging with data branch"

Commiting revert-doing one big “anti-commit”

git commit -am "Revert before merging with data branch-done by git_revert.sh"
echo "DONE: Single anti-commit!"

Cleaning up.

rm -f $whole_list

input_generator.R

The output file containing the array size between the two input arguments will later be used by run_xp. Note that the seed of the random generator, the number of array sizes to test and the number of repetition are constants of the R script, but they could also be its input parameters.

Constants of the script.

set.seed(42)
num=4
rep=5
# Disabling scientific notation in printing
options(scipen=999)

Reading the input arguments that represent the range.

args <- commandArgs(trailingOnly = TRUE)
range1<-as.numeric(args[1])
if (range1==-1)
  range1<-1000000
range2<-as.numeric(args[2])

Sampling random values from the range. Adding repetitions and again rearranging the vector.

if (range2==-1){
  x <- rep.int(range1, rep)
} else{
  x <- c(range1, range2)
  x <- append(x, sample(range1:range2, num, replace=F))
  x <- rep.int(x,rep)
  x <- x[sample.int(length(x))]
}

cat(x, file="input_values")

Ideally, the order in which different algorithm implementations are executed should be randomized as well, but at this point I don’t wish to change the source code file.

src/

parallelQuicksort.c

I copied the code from Arnaud Legrand, which he used for his performance evaluation classes for Master 2 students.

The code is quite simple at the moment and can be run in the following way.

./src/parallelQuicksort [1000000]

When run, the code executes initializes an array of the size given in argument (1000000 by default) with random integer values and sorts it using:

a custom sequential implementation;
a custom parallel implementation;
the libc qsort function.

Times are reported in seconds.

Makefile

Typical Makefile (also taken from Arnaud Legrand) which compiles C code

analysis/

Comparison.org

Org-mode analysis file using shell and perl to extract data from the output files. Later using R to analyze the data and generate plots using ggplot2.

Organization of Git

remote/origin/master branch contains

All the source code, scripts for running experiments, analysis code and this labbook with only “Documentation” part

remote/origin/xp# branches contains

Everything that is already in master branch
All the data related to specific experimental campaign
Also some important (not all) analysis .pdf report files

remote/origin/data branch contains

Everything that is already in master branch
All the data gathered from previous experimental campaigns and some analysis which compares different experimental sets of results

Examples for running scripts

For running test experiments, just to validate that the script is working as planned (without caring about the actual experimental results).

./run_xp.sh -t -c -v

For executing real experiment, with an input parameter 2000000.

./run_xp.sh -c -v 2000000

ChangeLog

Contains important coarse-grain information about the changes on the project.
This section may be redundant if:
- The project is relatively small and changes are captured well enough as Git commits
- Researchers keep the information about the progress in their private journals, which they use on a daily basis for all their projects
However, if the project is large and involves multiple collaborators, it is better to use this ChangeLog for noting all major changes

2015-11-09

Started writing this small example project

Git TAGs

When the experimental campaign is over, Git branches can be deleted as all the data has been merged into data branch. Still, it could be interesting to leave a track of that code version that can be spotted easier than a certain commit, in order to easy a future entry point for reproducing the experiments. For that, we put Git tags on some important branches.

Git tags can also be used as in the standard software development to annotate an important source code version.

The list of Git-tags can be copied by hand or even better using Org-babel.

git tag -n1

stable0.9       This code work correctly
xp1             First experiments to test workflow
xp2             Experiments to try analysis on multiple data
xp3             Comparing 3 algorithm implemetation for a range of array size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!