handout.tex

\documentclass[a4paper, 11pt, twoside]{article}  
\usepackage[hcentering,bindingoffset=10mm,margin=20mm]{geometry}
\usepackage{graphicx}
\graphicspath{{figures/}}  % Location of the graphics files
\usepackage{makeidx}
% \graphicspath{/figures}  % Location of the graphics files
\usepackage{multirow}
% Making R code work!
\usepackage{listings}
\usepackage{color}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{float} % allows forcing of float position
% \restylefloat{table}
\usepackage{bm}
\usepackage[parfill]{parskip}
% begins paragraphs with an empty line rather than an indent

% Add toc to contents
\usepackage[nottoc,numbib]{tocbibind}

\hypersetup{urlcolor=blue, colorlinks=false, hypertexnames=true}  % Colours hyperlinks in blue, but this can be distracting 
\usepackage[capitalise]{cleveref}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}

\crefname{lstlisting}{listing}{listings}
\Crefname{lstlisting}{Listing}{Listings}

\lstset{ %
  language=R,                % the language of the code
   basicstyle=\normalsize\ttfamily,           % the size of the fonts that are used for the code
%   numbers=left,                   % where to put the line-numbers
%   numberstyle=\tiny\color{gray},  % the style that is used for the line-numbers
%   stepnumber=2,                   % the step between two line-numbers. If it's 1, each line
                                  % will be numbered
%   numbersep=5pt,                  % how far the line-numbers are from the code
%   backgroundcolor=\color{white},      % choose the background color. You must add \usepackage{color}
%   showspaces=false,               % show spaces adding particular underscores
%   showstringspaces=false,         % underline spaces within strings
%   showtabs=false,                 % show tabs within strings adding particular underscores
   frame=false,                   % adds a frame around the code
   rulecolor=\color{white},        % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. commens (green here))
%   tabsize=2,                      % sets default tabsize to 2 spaces
%   captionpos=b,                   % sets the caption-position to bottom
%   breaklines=true,                % sets automatic line breaking
%   breakatwhitespace=false,        % sets if automatic breaks should only happen at whitespace
%   title=\lstname,                   % show the filename of files included with \lstinputlisting;
                                  % also try caption instead of title
  keywordstyle=\color{blue},          % keyword style
  commentstyle=\color{dkgreen},       % comment style
  stringstyle=\color{mauve},         % string literal style
  escapeinside={\%*}{*)},            % if you want to add a comment within your code
  morekeywords={*,...}               % if you want to add more keywords to the set
} 

% Include any extra LaTeX packages required
\usepackage[round,]{natbib}  % Use the "Natbib" style for the references
\usepackage{verbatim}  % Needed for the "comment" environment to make LaTeX comments
\usepackage{wallpaper}
\usepackage{cases}
\usepackage{wrapfig}
\makeindex
% \renewcommand{\includegraphics}[2][]{\fbox{#2}} %omits images
\begin{document}
 
\title{Introducing spatial microsimulation with R: a practical}
% \author{Robin Lovelace --- R.Lovelace at. Leeds. ac. uk}
\pagestyle{myheadings}
\author{Lovelace, Robin\\
\texttt{r.lovelace@leeds.ac.uk}}
\maketitle

\tableofcontents

\newpage
\section{Foundations}

This practical teaches the basic theory and practice of `spatial microsimulation'
using the popular free software package R. The term microsimulation means
different things in different disciplines, so it is important to be clear
at the outset what we will and will not be covering.

\emph{We will} be learning how to create \emph{spatial microdata}, the
basis of all spatial microsimulation models, using \emph{iterative
proportional fitting} (IPF). IPF is
an efficient method for allocating individuals from a
non-spatial dataset to geographical zones, analogous to the
`Furness method' in transport modelling, but with more constraints.
There are other ways of generating
spatial microdata but, as far as the author is
aware,\footnote{There is much need
to compare different methods of generating spatial microdata, for example by
building on the work of \citep{harland2012}. A project for doing
these tests in a transparent way has been started at
\href{https://github.com/Robinlovelace/IPF-performance-testing}{github.com/Robinlovelace/IPF-performance-testing}
and there are probably many opportunities for improving the computational efficiency
of the code presented in this course --- please get in touch if you would
like to contribute.}
this is the most effective and flexible for many applications.
An alternative approach using the open source `Flexible Modelling Framework'
program is described in detail, with worked examples, by \citet{harland2013microsimulation}.

\emph{We will not} be learning `dynamic spatial microsimulation'
\citep{Ballas2005b}: once the
spatial microdata have been generated and \emph{integerised}, it is up to the
user how they are used --- be it in an agent based model or as a basis for
estimates of income distributions at the local level or whatever.

We thus define spatial microsimulation narrowly in this tutorial as the process
of generating \emph{spatial microdata} (more on this below). The term
can also be used to describe a wider approach that harnesses individual-level data
allocated to zones for investigating phenomena that vary over space and between individuals
such as income inequality or energy overconsumption. In both cases, the generation
of spatial microdata is the critical element of the modelling process so the skills
learned in this tutorial will provide a firm foundation for further work.

One of the tricky things about spatial microsimulation for newcomers is its use
of specialist language. It is important to know exactly what is meant by
`spatial microdata' and other technical terms. To this end we have created a
glossary that provide succinct definitions (see \cref{gloss}). Any term that is
\emph{italicised} in the text has a glossary entry.

\begin{wrapfigure}{2}{11cm}
\begin{center}
   \includegraphics[width=10.5cm]{rstudio}
\end{center}
 \caption{The RStudio user interface. Note the project title `smsim-course' in
the top right and the `Files' tab at the top of the left hand window.}
\label{frstudio}
\end{wrapfigure}

\subsection{Prerequisites for this practical}

This practical uses the statical analysis and modelling software
R. We suggest you install and take a look at this powerful program
before getting started. We recommend using R within RStudio \cref{frstudio}, which makes
using R much easier. Instructions to install both R and RStudio can be found
at \href{http://www.rstudio.com/ide/download/desktop}{rstudio.com}.
% Could add screenshot of Dl ZIP here!!!

The other prerequisite for the course is downloading the example data.
These can be downloaded in a single zip file which can be found on the
course's GitHub repository:\footnote{GitHub
is used to serve this .zip file because if anything changes
in the tutorial, the .zip file from the same location will
automatically be updated. The other advantages of GitHub are its
visibility --- a pdf of this tutorial, for example,
is kept up-to-date at 
\href{https://github.com/Robinlovelace/smsim-course/blob/master/handout.pdf}{github.com/Robinlovelace/smsim-course}
in the `handout.pdf' file --- accessibility and encouragement of evolution: to contribute
to this project all one needs to do is
`\href{https://help.github.com/articles/fork-a-repo}{fork}' it.
} \\
\href{https://github.com/Robinlovelace/smsim-course}{github.com/Robinlovelace/smsim-course}.
Click on the ``Download ZIP'' button to the right of this page and extract
the folder into your desktop or other suitable place on your computer.

Once the folder has been successfully extracted open it in
your browser and take a look around. You will find a number of files
and two sub-folders
entitled `data', and `figures'. When you get to the
sections that use R code, it is useful for R to operate from
within the smsim-course-master folder. Probably the best way
to set this up is to open the file `smsim-course.Rproj' from
within RStudio (\cref{frstudio}).
Try this now and click on the `Files' tab in
the bottom right hand window of RStudio.
Before using the power of R in RStudio it's worth
understanding a bit about `IPF', the algorithm we will use to generate
the synthetic population or `spatial microdata' (\cref{fmsim-schema}).

\begin{figure}[H]
\begin{center}
\includegraphics[width=7cm]{msim-schema}
\end{center}
\caption{Schema of iterative proportional fitting (IPF) and combinatorial
optimisation
in the wider context of the availability of different data formats and spatial
microsimulation. \label{fmsim-schema}}
\end{figure}

\subsection{Learning by doing}
As \citet[xxii]{kabacoff2011r} put it regarding R, ``the best
way to learn is to experiment'' and the same applies to spatial microsimulation.
We believe you will learn the technique/art best not by reading about it, 
but by doing it. Mistakes are inevitable in any challenging task and should not
be discouraged. In fact it is by making blunders, identifying and then correcting them
that many people learn best. Think of someone learning to skate: no one ever
picks up a skateboard for the first time being able to `surf the sidewalks'. It
takes time, patience and plenty of falls before you master the art.
The same applies to spatial microsimulation.

Spatial microsimulation works by taking \emph{microdata} at the individual level
and using aggregate-level constraints to allocate these individuals to zones.
The two main methods are \emph{deterministic reweighting} and \emph{combinatorial optimisation}.
This practical takes the former approach using a process called \emph{iterative proportional fitting}
(IPF). IPF is used to increase the weights of individuals who are representative
of the target area and reduce the weights of individuals who are relatively
rare \citep{Lovelace2013-trs}.
The output is a \emph{spatial microdataset}.

A schematic of the process is shown in \cref{fmsim-schema}.
Take a look at the image and think about the process. But we don't want to get
bogged down in theory or applications in this course: we want to `get our
hands dirty'. Next we will do just that, by applying the process to some
example data.

\subsection{Input data} \label{s:theory}

Let's start with two very basic datasets. To aid understanding, we will primarily do
the reweighting by hand to understand the process before automating it on the computer.
As outlined in Figure~\ref{fmsim-schema},
there are two input files required to perform spatial microsimulation:
\begin{itemize}
  \item Survey data - information about individuals
  \item Geographically aggregated (`wide') data - typically aggregated Census tables
\end{itemize}

Table \ref{t:w} shows 5 individuals, who are defined by two constraint variables:
age and sex. This is analogous to the survey data shown in Figure~\ref{fmsim-schema}.
Table \ref{t:m} presents this same data in aggregated form, whose margin totals can
be used in the IPF procedure described shortly. Note that individual 3 is highlighted in
blue and bold for future reference.

\begin{table}[h]
% {r}{10cm}
\caption{Hypothetical input microdata (`survey data')
(the original
weights set to one). The bold and blue value is used subsequently for
illustrative purposes.
}
\label{t:w}
\begin{center}
 \begin{tabular}{llll}
\toprule
{Individual } & {Age} & {Sex}  & {Weight} \\
\midrule
1 & 59 & Male  & 1 \\
2 & 54 & Male & 1 \\
3 & {35} & {Male} & \textbf{\color{blue}1} \\
4 & 73 & Female & 1 \\
5 & 49 & Female & 1 \\
% 1 & 59 & m \\
% 2 & 54 & m \\
% 3 & 35 & m \\
% 4 & 73 & f \\
% 5 & 49 & f \\
\bottomrule
\end{tabular}
\end{center}
\end{table}

\begin{table}[htbp]
\centering
\caption[The aggregated results of the weighted
microdata set]{The aggregated results of the weighted
microdata set ($m$ --- think ``m'' for Modified or siMulated).
Note, these values are updated by each new set of weights.
% so change after each iteration.
% hence the number ``(1)''
% --- this table is for the initial weight.
}

\begin{tabular}{cllll}\toprule
Marginal totals&  & \multicolumn{2}{c}{$j$} & \\
& Age/sex & Male & Female & T \\ \midrule
\multirow{2}{*}{$i$} & Under-50 & \textbf{\color{blue}1} & 1 & \textbf{\color{blue}2}\\
& Over-50 & 2 & 1 &3 \\
& T & 3 & 2 &5\\
\bottomrule
\end{tabular}
\label{t:m}
\end{table}

Table \ref{t:s} contains data for a single area. This is analogous to the
geographically aggregated 'wide' data shown in Figure~\ref{fmsim-schema}.

\begin{table}[H]
\centering{
\caption{Small area constraints (geographically aggregated or `wide' input data).
Note that although individual 3 fits into both ``under 50'' and ``male'' categories,
only the value of the former is highlighted as age is the first constraint. \label{t:s}}
\begin{tabular}{cllll}
\toprule
Constraint $\Rightarrow$ & \multicolumn{2}{c}{$i$}& \multicolumn{2}{c}{$j$}\\
Category $\Rightarrow$ & $i_1$ & $i_2$ & $j_1$ & $j_2$ \\
Area $\Downarrow$  & Under-50 & Over-50 &  Male & Female\\
1  & \textbf{\color{blue}8} & 4 & 6 & 6\\
\bottomrule
\end{tabular}}
\end{table}

Now, it will be recognised by the astute reader that the data in
Tables \ref{t:m} and \ref{t:s} are incompatible; they are
two matrices with different dimensions ($3 \times 3$ and $1 \times 4$ respectively).
To solve this problem we can rearrange the small area census constraints data
($s$, presented in Table \ref{t:s}) to match the dimensions of
$m(1)$ (table \ref{t:m}).
\Cref{t:s2} illustrates $s$ in a different form.
Note that although we are ignorant of the internal values of this matrix,
it is of the same dimension as $m(1)$ and the marginal totals (denoted ``T'' for
total) can be directly compared. This is the basis of the method and provides a
way to systematically compare individual level data with geographic aggregates.

\begin{table}[h]
\centering
\caption[Small area constraints expressed as marginal totals]{Small
area constraints (matrix $s$) re-arranged to be the same dimension as $m$ (Table \ref{t:m}).
``s'' here stands for Small area data or conStraint. Note only
the marginal totals are known. The internal cell
values (represented by ``?'') must be estimated by spatial microsimulation.
}
\begin{tabular}{cllll}\toprule
Marginal totals&  & \multicolumn{2}{c}{$j$} & \\
& Age/sex & Male & Female & T\\ \midrule
\multirow{2}{*}{$i$} & Under-50 & \textbf{?} & ? & \textbf{\color{blue}8}\\
& Over-50 & ? & ? &4 \\
& T & 6 & 6 &12\\
\bottomrule
\end{tabular}
\label{t:s2}
\end{table}

With our input data in place, we can begin our first iteration.

\subsection{The IPF equation} \label{ipfeq}

% \begin{wrapfigure}{R}{9cm}
%    \includegraphics[width=8cm]{equation1}
%  \caption{\Cref{eq:ipf} in word form.}
% \label{feq}
% \end{wrapfigure}

Using the tables $m$ and $s$ presented above we readjust the weights of the
individuals so that:

\begin{itemize}
% \item If there are exactly the same number of 
  \item Individuals who are rare in the geographic zone of interest are given a smaller weight (we need fewer of them).
  \item Individuals who are common in the zone of interest are given a larger weight (we need more of them).
  \item Their sum equals the total population of the geography of interest (12 in our example).
\end{itemize}

For each constraint the current weight (Table~\ref{t:m}) is multiplied by the respective
total from the geographically aggregated (census) table (Table~\ref{t:s}) and
divided by the corresponding marginal total of the survey data (see \cref{t:m}).
For example, using the age constraint the new weight of individual 3 is calculated
based on the following numbers: the initial weight is set as 1, the number of
individuals under 50 in the geographically aggregated (Census) data is 8 (see Table \ref{t:s})
and the number
of individuals under 50 in the survey data is 2 (see Table~\ref{t:m}). Thus the new weight is
$1 \times \frac{8}{2}$. These figures
are highlighted in \textbf{\color{blue}bold and blue} in the preceding tables.
The same process is used to calculate the new weights for all individuals --- remember
we are only constraining by age for now.
% Test your understanding by calculating
% the value of the ? symbols in table~\ref{t:new-weights-part-complete} below.
% 
% \begin{table}[htbp]
% \centering
% \caption{Partially complete reweighted microdata.}
% \begin{tabular}{lllll}
% \toprule
% {Individual} & {Sex} & {age-group} & {Weight} &
% {New weight, $w(2)$} \\ \midrule
% 1 & Male & Over-50 & 1 & ? \\
% 2 & Male & Over-50 & 1 & ? \\
% 3 & Male & Under-50 & 1 & $1 \times \frac{\color{blue}8}{\color{blue}2} = 4$ \\
% 4 & Female & Over-50 & 1 & ? \\
% 5 & Female & Under-50 & 1 & ? \\
% \bottomrule
% \end{tabular}
% \label{t:new-weights-part-complete}
% \end{table}
% 
This process of \emph{reweighting} is done one constraint at a time and can
be described more succinctly in the form of a word equation
\cref{eq:ipf-word}, and more generally by \cref{eq:ipf} for constraint $i$ (age):

\begin{equation}
  Weight_{New} = Weight_{Current} \times \frac{Census_{Constraint \: Total}}{Survey_{Constraint \: Total}}
\label{eq:ipf-word}
\end{equation}

\begin{equation}
w(n+1)_{ij} = \frac{w(n)_{ij} \times sT_{i}}{mT(n)_{i}}
\label{eq:ipf}
\end{equation}

where $w(n+1)_{ij}$ is the new weight for individuals with characteristics $i$
(age, in this case), and $j$ (sex),  $w(n)_{ij}$ is the original
weight for individuals with these characteristics, $sT_{i}$ is element
marginal total of the small area constraint, $s$
(Table \ref{t:s}) and $mT(n)_{i}$ is the marginal total of category
$j$ of the aggregated results of the weighted
microdata, $m$ (Table \ref{t:m}).
$n$ represents the constraint number.

Do not worry about completely understanding the above procedure for now, its
meaning will probably take time to sink in.
More important is implementing it.

Follow the blue emboldened values in tables 1 to 4 to fill in the
question marks in \cref{t:new-weights} and calculate the new weight
of individual 3, a male under 50 years of age.
To validate your calculation, the sum of weights ($W_1$ to $W_5$)
should equal the population of the zone (12, in our example). This is a
useful check.

\begin{table}[htbp]
\centering
\caption{Reweighting the hypothetical microdataset in order to fit
Table \ref{t:s}.}
\begin{tabular}{lllll}
\toprule
{Individual} & {Sex} & {age-group} & {Weight} &
{New weight, w(2)} \\ \midrule
1 & Male & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
2 & Male & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
3 & Male & Under-50 & 1 & $1 \times \frac{\textbf{\color{blue}8}}{\textbf{\color{blue}?}} = \textbf{\color{blue}?}$ \\
4 & Female & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
5 & Female & Under-50 & 1 & $1 \times \frac{8}{2} = 4$ \\
\midrule
Total & & & 5 & 12 \\
\bottomrule
\end{tabular}
\label{t:new-weights}
\end{table}

\subsection{Re-aggregation after the first constraint} \label{sreag}
After creating $Weight_{New}$ for all possibilities of the first constraint (under 50 and 50$+$ for age),
the individual level data is re-aggregated before the next weight can be calculated using the next constraint.

Re-aggregating the individual-level data --- to compare the marginal totals with the constraint tables
ready for the second constraint --- will result in
\cref{t:m2}. To generate this table, the new weights ($w(2)$,
presented in \cref{t:new-weights}) are multiplied by the 
number of individuals in each category. Thus, to calculate the new
estimate for the number of males we multiply the weight of
males under 50 (4 --- follow the emboldened values above) by the
number of individuals in this category (1) and add the corresponding values
for males over 50 ($\frac{4}{3}$ and 2). This is as follows
(see the bold value in the bottom left of \cref{t:m2}):

\begin{equation}
\sum_{i = male} m(2) \; = \; \textbf{\color{blue} 4} \times \textbf{\color{blue}1} + \frac{4}{3} \times 2 \; =  \;
\textbf{\color{blue}4} + \frac{8}{3} \; = \textbf{\color{blue}6} \frac{\textbf{\color{blue}2}}{\textbf{\color{blue}3}}
\end{equation}

After performing this process of re-aggregation (\cref{t:m2}) for all categories
in the **age** constraint,n
the next stage is to repeat \cref{eq:ipf} for the **sex** constraint to generate a
third set of weights, by replacing
the $i$ in $sT_{i}$ and $mT(n)_{i}$ with $j$ and incrementing the value of n:

\begin{equation}
w(3)_{ij} = \frac{w(2)_{ij} \times sT_{j}}{mT(2)_{j}}
\label{eq:ipf2}
\end{equation}

\subsection{Test your understanding: the second constraint}
To test your understanding of IPF, try to apply \cref{eq:ipf2} to the
information above
and that presented in \cref{t:m2}.
This should result in the following vector of new weights, for individuals 1 to
5. Calculate the correct values and pencil them in in place of the question
marks.  % Add online link to answers - tinyurl


\begin{equation}
w(3) = (\frac{6 \times \frac{4}{3}}{6 \frac{2}{3}},\;\;\; \frac{6 \times
\frac{\;4\;}{\;3\;}}{\;?\;},\;\;\; \boldsymbol{ \frac{\color{blue}6 \times 4}{\color{blue} 6\frac{2}{3}
}}, \;\;\;\frac{\;?\; \times \;?\;}{?}, \;\;\;\frac{4 \times 6}{5 \frac{1}{3}})
% w(3) = (\frac{6}{5}, \frac{6}{5}, \frac{18}{5}, \frac{3}{2}, \frac{9}{2})
\end{equation}

After simplifying these fractions, the results are as follows.
One `sanity' check of your method here is whether the sum of these
weights is still equal to the area's total population of twelve. Test this is
true:

\begin{equation}
% w(3) = (\frac{6}{5}, \frac{6}{?}, \boldsymbol{ \frac{4 \times 6}{ 6\frac{2}{3}
% }}, \frac{?}{?}, \frac{9}{2})
w(3) = (\frac{6}{5}, \frac{6}{5}, \boldsymbol{\frac{\color{blue}18}{\color{blue}5}}, \frac{3}{2},
\frac{9}{2})
\label{finalout}
\end{equation}

What do the weights in w(3) actually mean? They indicate how representative
each individual is of the target zone after one iteration of IPF, constraining
by age and sex. Individual number 5 has the highest weight because there is
only one young female in the survey dataset yet seemingly many in the area in
question.

Notice also that after each iteration the fit between the marginal
totals of $m$ and $s$
improves.\footnote{This can be checked by comparing the aggregated weighted
individuals with the small area constraints. Total absolute error (TAE),
defined as the sum of all differences between simulated and observed marginal
totals, improves between $m(1)$ to $m(2)$, falling from
14 to 6 in \cref{t:m} and \cref{t:m2} above. TAE for $m(3)$ (not shown,
but calculated by aggregating $w(3)$) improves even more, to 1.3.
This number would eventually converge to 0 through subsequent
iterations, a defining feature of IPF.}


\begin{table}[htbp]
\centering
\caption[Aggregated results after constraining for age]{The
aggregated results of the weighted
microdata set after constraining for age ($m(2)$).
}

\begin{tabular}{cllll}\toprule
Marginal totals&  & \multicolumn{2}{c}{$i$} & \\
& Age/sex & Male & Female & T\\ \midrule
\multirow{2}{*}{$j$} & Under-50 & \textbf{\color{blue}4} & 4 & 8\\
& Over-50 & $\frac{8}{3}$ & $\frac{4}{3}$ & 4 \\
& T & $\boldsymbol{{\color{blue}6}\frac{\color{blue}2}{\color{blue}3}}$ & 5$\frac{1}{3}$ & 12\\
\bottomrule
\end{tabular}
\label{t:m2}
\end{table}

% A key benefit from a policy perspective is that
% IPF and other spatial microsimultion techniques
% can provide estimation of variables whose values are not
% known at the local level (e.g. income).

\subsection{IPF in a spreadsheet}
% Computer code is absent from most people's daily lives so the
% R code we use to automate IPF will likely seem like a foreign language
% to many. Spreadsheet programs like Microsoft Excel and LibrOffice Calc are
% comparatively well known, though.
R is an unfamiliar programme to many, so before we move on to producing an R script that can calculate weights, we will carry out an intermediary step in a spreadsheet programme.\footnote{This
example was produced by Phil Mike Jones --- see \href{http://www.philmikejones.net/}{philmikejones.net}.}
For small examples like the one we are using, a spreadsheet has the two main advantages that: it is already familiar to many so can act as a bridge to calculating IPF using a computer; and it is easier to see and recall more than one table in a spreadsheet compared to R with R Studio.

Open sms-spreadsheet-exercise.xlxs if you are using Microsoft Excel or sms-spreadsheet-exercise.odt (note the different file types) if you are using LibreOffice/OpenOffice Calc and follow the tasks outlined in the file. The steps are exactly the same as the pen and paper example above to allow you to more easily check your answers; to help with the transition of performing the calculations of IPF in a computer programme; and to aid recall of the procedure through repetition.

\section{IPF in R} \label{simplementing}
% priority: automate!!!
So far we have implemented IPF by hand and in a spreadsheet.
This section explains how the IPF
\emph{algorithm} described above is implemented in R, using the
same input data. Now we will be working within the RStudio
environment with the `smsim-course' project 
loaded and the `simple.R' file open in the top left panel
(\cref{frsimple}). 

\begin{figure}
\begin{center}
\includegraphics[width=15cm]{rs-simple}
\end{center}
\caption{The RStudio user interface with the `simple.R' file loaded in
the top left window and some output displayed in the bottom left window.}
\label{frsimple}
\end{figure}

Note a couple of things from this figure:
there are many explanations of the code and these are commented out
by the \texttt{\#} ('hash') symbol. Also not in the bottom left panel
R displays the output of the commands we send it from the script.
Hitting the \texttt{Ctrl} and \texttt{Enter} keys simultaneously from
within the upper (script) window will cause R to run these commands,
line by line. In the code blocks below, any outputs from R are preceded
by `\texttt{\#\#}'. Comparing these outputs with the output on your screen will
ensure that R is behaving as it should.

We will spend the rest of this section working through
the commands contained in this file.\footnote{A longer
tutorial is available from Rpubs, a site dedicated
to publishing R analyses that are reproducible:
\href{http://rpubs.com/RobinLovelace/5089}{rpubs.com/RobinLovelace/5089}.}
% This section is based on ``Spatial microsimulation in R: a
% beginner’s guide to iterative proportional fitting (IPF)'', a tutorial
% written to accompany a methods paper on integerisation
The code snippets are taken from the `simple.R' script file (just a plain text
document, like all R scripts) so you can copy and paste. However, it is
highly recommended to not look at this file until you need to, towards
the end of this
section. This is learning by typing, as advocated by \citet{shaw2013learn}.

\subsection{Loading and exploring the input data}

The data presented in the above worked example are saved in the
`simple' sub-folder of `data' as .csv files. To load them in R,
we use the following commands:

\begin{lstlisting}[float=h, caption={Loading the input data in R}, label=cusd]
ind <- read.csv("data/simple/ind.csv") # load the individual level data
cons <- read.csv("data/simple/cons.csv") # load aggregate constraints
\end{lstlisting}

To check that the \emph{data frames} have loaded correctly, simply
ask R to print their contents by entering the object's name. To display the
individual level data, for example, simply type \verb/ind/:\footnote{An alternative
solution is to click the object's name in `Environment' in
RStudio's top right window.}

\begin{lstlisting}[float=h, caption={Checking the contents of the individual
level data frame}, label=cout]
ind  # Show the data frame in R
##  id age sex
##   1  59   m
##   2  54   m
##   3  35   m
##   4  73   f
##   5  49   f
\end{lstlisting}

Note that the input data is identical to the tables illustrated above for
consistency. Instead of constraining for just one zone as we did in the example
above, we will fit the data to six zones here. To subset a dataset, add square
brackets to the end of the object's name in R. Entering \verb cons[1:2, ] ,
for example, should output the first two rows of the constraint variable.
Within the square brackets the number before the comma refers to rows (blank
means all rows) and the numbers after the comma refer to columns. If the output
of this command looks like the text below, congratulations are in order:
you have successfully loaded a subset of the constraint dataset.

\begin{lstlisting}[float=h, caption={Printing a subset of the constraint data
in R},
label=cou]
cons[1:2, ]
##    X16.49 X50. m f
##  1      8    4 6 6
##  2      2    8 4 6
\end{lstlisting}

Note that the top row is identical to \cref{t:s} --- we can therefore compare
the results of doing IPF on the computer with the results obtained by hand.

% !!! Manual input of data for book!!!
% Usually the input data for spatial microsimulation are loaded from
%  .csv files, one for each
% constraint and one for the input microdata. These are read-in with the command
% \verb read.csv . For the purposes of understanding how the model works,
% the dataset is read line by line, following the
% example above. The following code creates example datasets,
% based on the same hypothetical survey of 5 individuals described above,
% and 5 small areas. The spatial microsimulation model will select individuals
% based on age and sex and mode of transport (mode of transport
% is also used on the larger online example described in footnote \ref{fnrpub}).
% For consistency with the (larger) model used for the paper, 
% the individual level data will be referred to as USd (Understanding Society dataset)
% and the geographic data as all.msim (for all constraint variables).
% The code to read-in the individual level data are presented in code sample \ref{cusd}.
% When called, the data are then displayed as a table (see listing \ref{cout}).
% \begin{lstlisting}[float=h, caption={Manual input of individual level data
% in R}, label=cusd]
% # Read in the data in long form (normaly read.table() used)
% c.names <- c("id", "age", "sex")
% USd <- c(       1, 59, "m",
%                 2, 54, "m",
%                 3, 35, "m",
%                 4, 73, "f",
%                 5, 49, "f")
% USd <- matrix(USd, nrow = 5, byrow = T) # Long data into matrix
% USd <- data.frame(USd) # Convert this into a dataframe
% names(USd) <- c.names # Add correct column names
% USd$age <- as.numeric(levels(USd$age)[USd$age]) # Age is a numeric
% \end{lstlisting}
% \begin{lstlisting}[float=h, caption={Output of the USd data frame}, label=cout]
% USd # Show the data frame in R
% ##   id age sex
% ## 1  1  59   m
% ## 2  2  54   m
% ## 3  3  35   m
% ## 4  4  73   f
% ## 5  5  49   f
% \end{lstlisting}
% The same procedure applies to the geographical data (listing \ref{cgeo}).
% \begin{lstlisting}[float=h*, caption={Geographic data input}, label=cgeo]
%  category.labels <- c("16-49", "50+" # Age constraint
%              ,"m", "f" # Sex constraint
%              # more constraints could go here
%              )
% all.msim <- c(  8, 4,    6, 6,   # Original aggregate data
%                 2, 8,    4, 6,   # Elderly
%                 7, 4,    3, 8,   # Female dominated
%                 5, 4,    7, 2,   # Male dominated
%                 7, 3,    6, 4    # Young
%                 )
% all.msim <- matrix(all.msim, nrow = 5, byrow = T) 
% all.msim <- data.frame(all.msim) # Convert to dataframe
% names(all.msim) <- category.labels # Add correct column names
% \end{lstlisting}

% IPF relies on the assumption that all constraint variables will contain the
% same number of people. This is logical (how can there be more people classified
% by age than by sex?) but can cause problems for constraint variables that use
% only a subset of the total population, such as those who responded to questions on
% travel to work. To overcome this problem, it is possible to normalise the
% constraint variables, setting the total for each to the one that has the most
% reliable total population. This worked example simply checks whether
% or not they are (listing \ref{ccheck}).

% \begin{lstlisting}[float=h, caption={R code to check the constrain populations
% match}, label=ccheck]
%  # Check totals for each constraint match
% rowSums(all.msim[,1:2]) # Age constraint
% ## [1] 12 10 11  9 10
% rowSums(all.msim[,3:4]) # Sex constraint
% ## [1] 12 10 11  9 10
% 
% rowSums(all.msim[,1:2]) == rowSums(all.msim[,3:4])
% ## [1] TRUE TRUE TRUE TRUE TRUE
% \end{lstlisting}

Using the subset command just introduced, we must now subset the constraint (\verb cons ) data frame so we can access each constraint (age or sex) separately. Using your knowledge of the subsetting command, we select columns 1 to 2, and 3 to 4 leaving all rows in place. Give these subsets (constraints) the name \verb con1  and \verb con2  respectively:

\begin{lstlisting}[float=h, caption={Subsetting the constraint file}]
con1 <- cons[ ,1:2]
con2 <- cons[ ,3:4]
\end{lstlisting}


To make the individual level dataset comparable to the aggregate level we must
first categorise it and count the number in each category. This job is
performed by the R \emph{script} categorise.R, contained in the `simple' data
folder. Run it by entering
\verb source("data/simple/categorise.R")) .\footnote{Note
that the
`categories.R' script must be modified each time it is used with a different
dataset.}
This results in the creation of a
new object called \texttt{ind.cat} in R, which is a matrix of 0s and
1s with same number of columns as the constraint data --- try
exploring \texttt{ind.cat} using the square bracket notation mentioned above.

To check that the script has worked
properly, lets count the number of individuals it contains:

\begin{lstlisting}[float=h, caption={Output from ind.cat}]
colSums(ind.cat)
##    a16.49   a50.      m      f
##         2      3      3      2 
\end{lstlisting}

The sum of both age and sex variables is 5 (the total number of individuals): it
worked! Now the data is in the same form as the constraint variables and we
have checked the data makes sense, we can create the `weight' matrix and begin
reweighting the individuals to zones (\cref{lscreate}).

\begin{lstlisting}[float=h, caption={Creating the weight and aggregated
individual level matrices --- see lines 21 and 22 in `simple.R'}, label=lscreate]
weights <- array(1, dim=c(nrow(ind),nrow(cons))) 
ind.agg <- matrix(rep(colSums(ind.cat), nrow(cons)), nrow(cons) )
\end{lstlisting}

The code in \cref{lscreate} may not make total sense. Do not worry about this for now, you can always
check what is going on later by checking R's documentation (e.g. try entering
\verb ?rep ) or
by pressing the tab key when entering arguments for R commands in RStudio. For the time
being, it is best to press-on with the example and understand the concepts ---
the details can be looked at later.

\subsection{Reweighting by the age constraint}

With all of these objects in place, we are ready to begin allocating new weights
to the individuals via IPF. Below is written in code the IPF formula presented
in \cref{ipfeq}. It begins with a couple of \emph{for loops}, one to iterate through
each zone (hence the \texttt{1:nrow(cons)} argument, which means ``from
1 to the number of zones in the constraint data'') and one to iterate
through each category within constraint 1 (under 50 and over 50 in this case).
Before running this code, ensure that all previous commands in the `simple.R'
script have been performed.

\begin{lstlisting}[float=h, caption={Calculating new weights for the individuals
based on constraint 1 (age) via IPF --- see line 30 and beyond in `simple.R'}]
for (j in 1:nrow(cons)){
  for(i in 1:ncol(con1)){
 weights[which(ind.cat[,i] == 1),j] <- con1[j,i] / ind.agg[j,i]}}
\end{lstlisting}

The above code updates the weight matrix by dividing
the census constraint (\texttt{con1}) by the aggregated version
of the individual level data, just as in \cref{ipfeq} above.
Further explanation is provided in \cref{sipfcake}: for now let's push on
to the next block of code, which updates the aggregated data (named
\texttt{ind.agg} in R) based on the the new weights.

\subsection{Re-aggregation}

\begin{lstlisting}[float=h, caption={Re-aggregation of the individual
level data based on the new weights (see lines 35 and 36 in `simple.R')}]
for (i in 1:nrow(cons)){ # convert con1 weights back into aggregates
  ind.agg[i,]   <- colSums(ind.cat * weights[,i])}
\end{lstlisting}

The above line of code multiply the new weights by the
flat version of the individual-level data, one zone (i) at a
time. This is equivalent to the re-aggregation stage described in
\cref{sreag} and the result can be checked by printing the first
row of our new \texttt{ind.agg} object:

\begin{lstlisting}[float=h, caption={The new aggregate (marginal) totals for each category
after weights have been constrained by age. Compare with \cref{t:m2} above.}]
ind.agg[1, ]
##    [1] 8.000000 4.000000 6.666667 5.333333
\end{lstlisting}

To recap, we have updated `ind.agg' with new values
that are constrained by the age variable. This seems to improve the fit
between the census and individual-level data as the marginal totals for
age are now equal to the aggregate values for age of zone 1.
This is further
verified in the `simple.R'
script file with a line of code that calculates the Total Absolute Error
after each iteration --- \verb sum(abs(ind.agg - cons)) .
Check that the error decreases with each iteration (see \cref{svalid} for
more on model checking and validation).

% At this stage you should switch to running the code in `simple.R'. In RStudio
% to run a line of code from the top left window, simply press \verb Ctl-Enter .
Continue to enter the code contained in `simple.R' line by line to constrain
by the second constraint, sex.
Notice that we have done in code what was previously done by
hand and in a spreadsheet. % TODO: and in a spreadsheet!
Depending on which method you feel most comfortable with, it is worth
spending time repeating the previous steps to ensure they make sense.

The other advantage of running the same process 3 times in 3 different ways
is that we can cross-compare the results from each.
Compare
the following listing with \cref{finalout}, for example: are the weights the same?

\begin{lstlisting}[float=h, caption={W(3) after constraining by age and sex in R. 
Compare with \cref{finalout}}]
weights3[,1] # check the weights allocated for zone 1
##    [1] 1.2 1.2 3.6 1.5 4.5
\end{lstlisting}

% \newpage
\subsection{Iterations}

In the above example, we have seen the `bare bones' of spatial microsimulation
using IPF to generate weights from which sample populations can be created.
To perform multiple iterations of the same model, we have prepared a slightly
more complex script than `simple.R' called `simple-iterations.R' that can 
also be found in the project's root directory. We set the number
of iterations manually (just
change the value of \verb num.its ). This happens by calling another
script called `e2.R' which runs the re-weighting process
repeatedly, once for each iteration. Notice that only 2 full
iterations, the perfect solution is reached (r = 1).

There is great scope for taking the analysis further:
some further tests and plots are presented on the on-line
versions of this section. The simplest case is contained in
Rpubs document \href{http://rpubs.com/RobinLovelace/6193}{rpubs.com/RobinLovelace/6193} 
and a more complex case (with three constraints) can be found in Rpubs document
\href{http://rpubs.com/RobinLovelace/5089}{5089}. For now, however,
we progress to a more complex example, CakeMap.

\section{CakeMap: A worked example}

In this section we will process real data to arrive at an important result:
an estimate of the amount of cake eaten in different wards in West Yorkshire. The
example is deliberately rather absurd: hopefully this will make
the technique more memorable. An attempt has been made to present the method
in a generalisable was as possible, allowing users to apply the technique described
here to their own data.

The code needed to run the main part of the example is contained within `cMap.R'.
Note that this script makes frequent reference to files contained in the `cakeMap'
folder `data'. \verb read.csv("data/cakeMap/ind.csv") , for example, is used to read
the individual level data into R.

Because this example uses real world data (albeit in anonymised form),
we will see how important the process of `data cleaning' is.
Rarely are datasets downloaded in exactly the right form for them to
be pumped directly into a spatial microsimulation model. For that reason
we describe the process of loading and preparing the input data first:
similar steps will be needed to use your own data as an input into a
spatial microsimulation model.

\subsection{Loading and preparing the input data}
The raw constraint variables for CakeMap were downloaded from
the Infuse website (\href{http://infuse.mimas.ac.uk/}{infuse.mimas.ac.uk/}).
These, logically enough, are stored in the `cakeMap/data/' directory
as .csv files and contain the word `raw' so it is clear which files
contain the raw data. The file `age-sex-raw.csv', for example is the raw
age and sex data that was downloaded. As the screenshot in \cref{fraw} shows,
these datasets need to be
processed before they can used as an input
in our model. `con1.csv' stands for
`constraint 1' and represents this age/sex data after it has been processed.

\begin{figure}
 \includegraphics[width=12cm]{raw-data-screenshot}
\centering
\caption{The raw input data for the CakeMap model, downloaded from the InFuse website.}
\label{fraw}
\end{figure}

To ensure reproducibility in the process of converting the raw data into
this `spatial microsimulation ready' dataset, all of the steps used to
clean and rearrange the data have been saved. Take a look at the R script file
`process-age.R' --- these are the kinds of steps that the researcher will need to
undertake before performing a spatial microsimulation model. 
Here is not the place to delve into the details of data reformatting;
there are resources dedicated to that \citep{tidy-data, kabacoff2011r}.
However, it is worth taking a look at the `process-age.R' script and the
other `process*' files to see how R can be used to quickly
process complex raw data into a form that is
ready for use in spatial microsimulation.

The input data generated through this process of data preparation are named
`con1.csv' to `con3.csv'. For simplicity, all these were merged into a single
dataset called `cons.csv'. All the input data for this section
are loaded with the following commands:

\begin{lstlisting}[float=h, caption={Loading the input data for CakeMap (see `cMap.R')}]
ind <- read.csv("data/cakeMap/ind.csv")
cons <- read.csv("data/cakeMap/cons.csv")
\end{lstlisting}

Take a look at these input data using the techniques learned in the previous
section. To test your understanding, try to answer the following questions:
what are the constraint variables? How many individuals are in the survey microdataset?
How many zones will we generate spatial microdata for?

For bonus points that will test your R skills as well as your practical knowledge
of spatial microsimulation, try constructing queries in R that will automatically
answer these questions.

It is vital to understand the input datasets before trying to model them, so
take some time exploring the input. Only when satisfied with your understanding of
the datasets (a pen and paper can help here, as well as R!) is it time
to move on to generate the spatial microdata using IPF.

\subsection{Performing IPF on the CakeMap data}
\label{sipfcake}
The R script used to perform IPF on the CakeMap data is almost identical to
that used to perform the process on the simple example in the previous section.
The main difference is that there are more constraints (3 not 2) but otherwise
the code is very similar.  Confirm this for yourself: it will be instructive
to compare the `simple.R' and
`\href{https://github.com/Robinlovelace/smsim-course/blob/master/cMap.R}{cMap.R}'
script files. Note that
the former works on a tiny example of 5 input individuals whereas the latter
calculates weights for 916 individuals. The code is generalisable, meaning
that the IPF process would still work given slightly different inputs.
The script would still work if more individuals or zones were added although
the code will need to be adjusted to accept differenct constraints.
To ensure the process works with different
constraints, the file `categorise.R' must also be.

Note that `cMap.R' contains 100 lines --- longer than the `simple.R' file.
The additional length is due
to the third constraint, additional iterations (see line 66 onwards) and some
preliminary analysis. The best way to understand what is happening in this
script is simply to read through it and run chunks of it in R (remember the 
\verb Ctl-Enter ~ command) to see what happens. Try to `think like a computer'
and imagine what each line of code is doing. This is inevitably difficult
at first so it is recommended that users add more comments to lines
that cause confusion to help remember what is going on. To help this process,
the meaning of some of the trickier and more important lines is explained below.
One potentially confusing area is the representation of the \texttt{weights} object,
used to store the weights after each constraint. To understand what is happening
try entering this code into R and seeing what happens:

\begin{lstlisting}[float=h, caption={Creating the initial weight matrix --- see lines
28 to 30 in cMap.R}]
# create weights in 3D matrix (individuals, areas, iteration)
weights <- array(dim=c(nrow(ind),nrow(cons),num.cons+1))
weights[,,num.cons+1][] <- 1 # sets initial weights to 1
ini.ws <- weights[,,num.cons+1]
\end{lstlisting}

In the above code, \texttt{weights} is created by the \texttt{array}
command, which in fact creates many 2 dimensional matrices in a
`data cube'. This is different from the weight object used in the
`simple.R' script, which had only 2 dimensions.

If this is confusing, just remember that
the `square' weight matrix --- where each column represents a zone
and each row represents an individual --- can be returned to by
specifying the third dimension. Dimensions in R are specified by
the square brackets, so the initial weights can be called by typing
\texttt{weights[,,4]}. This should result in hundreds of 1s, the initial weights.
To print a more manageable-sized output, remember how to subset the other
dimensions: \texttt{weights[1:3,1:5,4]}, for example, will print the first
3 rows (individuals) and first 5 columns of the 4th matrix. Why do we do it like this?
So that we only need one weights object to remember all the previous weights
so we can go back and check whether the model fit is improving from one constraint
and one iteration to the next.

Now, when the new weights are set after iteration 1, they are saved as follows,
within a nested \emph{for loop} that iterates over all zones (j) and
constraint categories (i):

\begin{lstlisting}[float=h, caption={Saving the new weights after the first
constraint of IPF --- see line 43 in `cMap.R'}, label=cmapipf]
weights[which(ind.cat[,i] == 1),j,1] <- con1[j,i] /ind.agg[j,i,1]}}
\end{lstlisting}

Note the heavy use of square brackets in \cref{cmapipf}.
Let's break down each element of the code within these square `subsetting' brackets:
\begin{itemize}
 \item \texttt{which(ind.cat[,i] == 1)} is a subset of the rows in the weight matrix
 which correspond to all individuals who belong to category i. (Remember,
 \texttt{ind.cat} is the `wide' version of the individual level data, containing
 1s for all categories to which each individual belongs.)
 \item \texttt{,j,1]} specify the column and constraint respectively.
 Because we are in a for loop, j simply runs through all categories within the
 current constraint (age/sex). The 1 is selected because this is the first set of
 re-weighting (this is analogous to say $w(2)$ --- the first new weight ---
 in the IPF equations in the previous section).
 \item Finally, \texttt{con1[j,i] /ind.agg[j,i,1]} is R's implementation of
 \cref{eq:ipf}: we divide the desired census values by the current simulated
 totals for each category based on the individual's current weight.
\end{itemize}

This explanation should help understand what is going on in other
parts of the script file also --- for example try printing subsets of
the \texttt{weights} object after running constraint 1, 2 and then 3 to
see how the weights change.

% \subsection{Processing the output}

\subsection{Integerisation}
Integerisation increases the utility of results generated
by IPF, such as the weights produced in the cakeMap example.
Many of the most useful things we can do with individual-level data can only
be done with \emph{whole individuals}. The weight matrices we have generated, however,
are \emph{fractional weights}, which could imply, for example that 0.437 of an
individual resides in a zone. From the perspective of an agent based (but not
aggregated level) model, this is clearly absurd.

Here is not the place to delve into the various methods available for integerisation ---
see \citep{Lovelace2013-trs} for detail on the problem. Suffice to say that
of the two broad approaches available, deterministic and probabilistic, the latter
is found to generate more accurate results.

Two scripts to \emph{integerise} the final weights (referred to as \texttt{fw}
in the code) of the cakeMap model
are provided: `pp-integerise.R' and `TRS-integerise.R', which stand for
`proportional probabilities' and `truncate, replicate, sampe'.
These reside in the `data/cakeMap' folder.
We will not describe the theory underlying the method here (see \citealp{Lovelace2013-trs} for more).
Instead, we will describe a few of the functions contained in these scripts that are
the keys to understand how they work. Both scripts begin by creating
a series of objects that are used during the integerisation process
and to save the results \cref{intobjects}.  

\begin{lstlisting}[float=h, caption={The new objects that
are created in at the beginning of the integerisation scrips (see lines 7 to 8 in `pp-integerise.R' and
`TRS-integeris.R').}, label=intobjects]
intall <- ints <- as.list(1:nrow(cons)) 
intagg <- cons * 0
\end{lstlisting}

In the first line of \cref{intobjects}, two identical objects are created
(note the double use of the \texttt{<-} object assignment symbol).
These are set objects in the list class, allowing each element to be
allocated as any other object. This differs from the stricter
`data.frame' and `matrix' or `array' objects we have been using so far.
Subsequently in the script, individual elements are referred to using
double square brackets. For example, see \texttt{ints[[i]] <-} on line
15 of `pp-integerise.R'. Using list elements provides flexibility:
we can set any list object to be anything we like without
worrying about constraints imposed by its class or dimensions.

The key function in the proportional probabilities
script is \texttt{sample}, used in a for loop over
all areas (\cref{pp-sample}).

\begin{lstlisting}[float=h, caption={The sample function,
key to R's implementation of the proportional probabilities
integerisation algorithm (see lines 15 and 16 of `pp-integerise.R').}, label=pp-sample]
ints[[i]] <- sample(which(fw[,i] > 0), size = sum(con1[i,]), 
                      prob=fw[,i], replace = T) 
\end{lstlisting}

The \texttt{sample} function in \cref{pp-sample} contains four arguments, and
the meaning of each is described below:
\begin{itemize}
 \item The first item (\texttt{which(fw[,i] > 0)}) provides the \emph{vector} of
numbers or characters from which the sample will be drawn. In this case we are referring
to the weights of all individuals for zone i (all weights should be above 0).
\item The size argument tells R how many items from the object should be selected.
In this case it is simple: the total population of zone i.
\item The cleverest thing about R's sampling function is that it allows probabilities
of being selected to be assigned to each element. Here we are setting the probability
of selection as proportional to the final weight. (Clearly the probability of
selection can never be greater than 1, as is the case with many weights; R
automatically normalises the values.)
\item The final argument tells R that we want to allow repeat sampling:
the same individual can be selected more than once. This makes sense in a
dataset where wheights can get very large: an individual with a weight of 10.4,
for example, should ideally be selected between 10 and 11 times.
\end{itemize}

Note that `trs-integerise.R' also uses the sample function, in this case to
`top up' individuals already selected from the `truncate, replicate' stages
of the `truncate, replicate, sample' method \citep{Lovelace2013-trs}.
For more information on implementing TRS in R, refer to this paper and the paper's detailed
\href{http://eprints.whiterose.ac.uk/77037/7/lovelacesupplement-3.pdf}{supplementary information},
which provides reproducible results to suggest using TRS for accuracy.
(As an aside, the accuracy of each integerisation algorithm was
tested on the cakeMap data --- showing TRS to yield better results for this application.
The results can be reproduced by running the `integerisation-test.R'
script, \href{https://github.com/Robinlovelace/smsim-course/tree/master/vignettes}{located in the `vignettes' folder}.)

The most useful result of these integerisation algorithms is the object \texttt{intall}.
This is list of data frames containing, each containing a population equal
to the population of its respective zone. Each has the same
variables as are present in the original survey dataset. Make
this object easier to analyse, a series of commands is provided
(but commented out) at the end of the script to convert the list into
a more manageable data.frame object. Try entering
\texttt{nrow(intall.df)} after running this code (remove the comments first!)
We have cloned hundreds of thousands of individuals.
Clearly, all this data takes up more RAM on the computer, as can be seen by
asking \texttt{object.size(intall.df)}. Try comparing this result with the
size of the original survey dataset `ind'. From this point onwards,
with the individual-level data in a straightforward dataframe
(with the zone number kept in the `zone' column), the modelling and
analysis applications should be comparatively straightforward.

Next we move on to a vital consideration in
spatial microsimulation models such as cakeMap: validation.

\subsection{Model checking and validation}
\label{svalid}
To make an analogy with food safety standards, openness about mistakes is
conducive to high standards \citep{Powell2011}. Transparency in model
verification is desirable for similar reasons. The two main strategies are 1) 
comparing the model results with knowledge of how it \emph{should}
perform \emph{a-priori} (model checking) and 2) comparison between the model
results and empirical data (validation).

Within the two R scripts we have been using for reweighting the individual-level
data so far (`simple.R' and `cMap.R'),
there are \emph{already} commands whose function is to check the
results are as expected. Returning to the former, note that on line 24
of `simple.R', the fit between the survey data and census data is
tested with \texttt{sum(abs(ind.agg - cons)) \# the total absolute error}.
This, as the comment implies, calculates the total absolute error
(TAE), which is a good measure of the fit between two aggregate-level
datasets and which is defined by the following formula:

\begin{equation}
 TAE = \sum\limits_{ij}|U_{ij} - T_{ij}|
\end{equation}

Standardised Total Error is a related measure: $SAE = TAE/P$
where $P$ is the total population of the study area.
Thus TAE is sensitive to the number of people
within the model. SAE is not --- see more on model checking and tests of
fin in \citet{Lovelace2013-trs}

Note that TAE is calculated twice more in `simple.R', after each
additional constraint has been applied. Work through the code slowly and
check how TAE reduces after each constraint and each iteration.

Beyond typos or simple conceptual errors in model code, more fundamental
questions should be asked of spatial microsimulation models. The validity
of the assumptions on which they are built, and the confidence one should have
in the results are important. For this we need external datasets.
Validation is therefore a tricky topic --- see \citet{Edwards2009} for
more on this and \cref{sanalysis} for (an albeit unreliable) comparison
between estimated cake consumption and external income estimates.

\subsection{Visualising the results}
Visualisation is an important part of communicating quantitative
data, especially so when the datasets are large and complex so not
conducive to description with tables or words.

Because we have generated spatial data, it is useful to create 
a map of the results, to see how it varies from place to place.
The code used to do this found in `cMapPlot.R'. A vital function within
this script is `merge', which is used to add the simulated cake data
to the geographic 
data frame:\footnote{`join'
is an alternative to merge from the `plyr' package also used
in the `cMapPlot.R' script that performs the
same task. Assuming `plyr' is loaded --- `library(plyr)' you can read
more about join by entering `?join' in R.}

\begin{lstlisting}[float=h, caption={The merge function for joining
the spatial microsimulation results with geographic data. 
Compare with \cref{finalout}}]
merge(wardsF, wards@data, by = "id")
\end{lstlisting}

The above line of code by default selects all the data
contained in the first object (`wardsF') and adds to it new variables
from the second object based on the linking variable (in this case ``id'').
Also in that script file you will encounter the function \verb fortify ,
the purpose of which is to convert the spatial data object into a
data frame. More on this process is described in \citet{lovelace2014introduction}.
The final map result of `cakeMapPlot.R' is illustrated \cref{fcmap}.

\begin{figure}[h]
 \centering
\includegraphics[width=11cm]{cakeMap}
\caption{Choropleth map of the spatial distribution of average frequency of
cake consumption in West Yorkshire, based on simulated data.} 
\label{fcmap}
\end{figure}

\subsection{Analysis} \label{sanalysis}
Once a spatial microdataset has been generated that we are happy with,
we will probably want to analyse it further. This means exploring it --- its main features,
variability and links with other datasets. To illustrate this process we will
load an additional dataset and compare it with the
estimates of cake consumption per person  generated in the previous
section at the ward level.  

The hypothesis we would like to test is that cake consumption is linked to deprivation:
More deprived people will eat unheathily and cake is a relatively cheap `comfort food'.
Assuming our simulated data is correct ---  a questionable assumption but lets roll with
it for now --- we can explore this at the ward level thanks to a 
\href{http://www.neighbourhood.statistics.gov.uk/dissemination/instanceSelection.do?JSAllowed=true&Function=&%24ph=61&CurrentPageId=61&step=2&datasetFamilyId=266&instanceSelection=121427&Next.x=22&Next.y=13&nsjs=true&nsck=false&nssvg=false&nswid=1920}{dataset on 
modelled income from neighbourhood statistics}.

Because the income dataset was produced for old ward boundaries (they were slightly
modified for the 2011 census), we cannot merge with the spatial dataset based on the
new zone codes. Instead we rely on the name of the wards. The code below provides a
snapshot of these names and demonstrates how they can be joined using the `join' function.

\begin{lstlisting}[float=h, caption={The merge function for joining
the spatial microsimulation results with geographic data. 
Compare with \cref{finalout}}]
wards@data <- join(wards@data, imd)
summary(imd$NAME %in% wards$NAME)
##       Mode   FALSE    TRUE    NA's 
##    logical      55      71       0 
\end{lstlisting}

The above code first joins the two datasets together and then checks the result by
seeing how many matches names there are. In practice the fit between old names and new
names is quite poor: only 71 out of 124. In a proper analysis we would have to solve this
problem (e.g. via the command \verb pmatch ~ which stands for partial match).
For the purposes of this exercise we will simply plot income against
simulated cake consumption to gain a feeling what
it tells us about the relationship between cake consumption and wealth (\cref{incomeCake}).

\begin{figure}
\centering
 \includegraphics[width=7cm]{incomeCake}
\caption{Relationship between modelled average ward income and
 simulated number of cakes eaten per person per week.}
\label{incomeCake}
\end{figure}

The question raised by this finding is: why? 
Not why is cake consumption higher in wealthy areas (this has not
been established) but: why has the model resulted in this correlation?
To explore this question we need to go back and look at the individual
level data. The most relevant constraint variable for income was class.
When we look at the relationship between social class and cake consumption
in the Dental Health Survey, we find that there is indeed a link:
individuals in the highest three classes (1.1, 1.2, 2) have an average
cake intake of 3.9 cakes per week whereas the three lowest classes have
an average intake of 3.7. This is a relatively modest difference but,
when averaging over large areas, it helps explain the result.
The class dependence of cake consumption in the Dental Health Survey is
illustrated in \cref{hm}.

\begin{figure}
\centering
 \includegraphics[width=8cm]{hm}
\caption{Relationship between social class and cake consumption in the
individual level data.}
\label{hm}
\end{figure}

\section{Discussion}

What have we learned during the course of this practical tutorial?
Given that it's geared towards beginners, from one perspective
one could say `not a lot' relative to the many potential applications.
The emphasis on understanding of the basics was deliberate: it is only
on strong foundations that large and long-lived buildings can be 
constructed and the same applies to geographical research.

The focus on foundations is also due to the nature of spatial microsimulation
research: few people are actively involved in applied work, let alone the
development of new software and methods.
Specialist skills and understanding are needed to join this research area
yet the barriers to entry are high.
A central outcome of this course has been to provide an accessible `way in'
to this exciting research area. Specifically, we have learned how to:
\begin{itemize}
 \item Perform iterative proportional fitting, one of the commonly used
techniques for spatial microsimulation, by hand, in a spreadsheet and in R.
 \item Iterate the process any number of times using pre-made R code.
 \item Use simple queries and visuals to check the process has worked as expected.
 \item Process the spatial microdata to create estimates of target variables and
join this with output with geographics zones.
 \item Map the result and begin to look for explanation in the dataset.
\end{itemize}

An important lesson from the cakeMap example is
that spatial microsimulation can be used in careless and misleading ways.
The method is powerful, but it has dangers and limitations that must be ackowledged.
Many articles on spatial microsimulation omit technical details of the method,
let alone code that would allow the findings to be reproduced. Great
care must be taken not to mislead.

Returning to cakeMap, we have \emph{not}
demonstrated cake consumption to be higher in wealthy areas of Leeds.
We have done nothing of the sort! We \emph{have} created spatial microdata
for Leeds based on 3 constraint variables.
This spatial microdataset is not `real'. We can have a high level of confidence that the
individual attributes constrained by census data are roughly representative of the region.
Yet the interrelationship between these constraint variables
(e.g.~the cross tabulation between age and car ownership) will tend towards the average
contained within the survey microdataset and is likely to mask regional variability. In an ward
where the proportion of young drivers is very high, for example, our model result is likely
to provide underestimates of the numbers of young people owning cars.

The estimates generated for the target variable --- cake consumption --- are likely to be
far from reality. They reflect the relationship between cake consumption and
a selection of constraint variables at the national level. Do we have good evidence to suggest
that age/sex, socio-economic class and household car ownership are good proxies of cake consumption
across different parts of the country? No. So we should treat the results as what they are:
an amusing example of what you can do with spatial microsimulation, not a reliable description
of the real world. To validate this example we would need to conduct some kind of randomised
survey in the target wards to identify whether cake consumption really does vary in the ways
described by the model. Most likely we would find that it does not.

This brings us nicely on to the final point of discussion: how you use spatial microsimulation in
your own research. This of course is up to you, but during the process of this course we hope that
you have picked up some ideas about best practice in the field. These include:
\begin{itemize}
 \item Ensure, to the extent possible, \textbf{reproducibility in your method and findings}. This includes, at a minimum,
clear description of the input data, explanation of the method used and the software needed. As illustrated
with the cakeMap example, it is now relatively easy to ensure complete reproducibility even in complex analyses.
This will not always be possible due to confidentiality of input data. However, the creation of an example dataset
and provision of code should always be possible. This is highly recommended as it will greatly help others reproduce
your findings (improving the scientific credibility of your research), provide a learning opportunity for yourself and
others and increase the probability of other academics citing your work.
 \item Use spatial microsimulation only when it is the \textbf{most appropriate tool for the job}. 
This means that if there are alternatives such as geographically weighted regression analysis
or analysis of large secondary datasets, these should be considered beforehand. Generating an entirely
new individual-level dataset is not to be taken lightly and risks distracting from more grounded research.
Thus it should be seen as an \emph{addition to} rather than a \emph{replacement for} more established methods.
\item When spatial microsimulation is used, \textbf{be transparent about its underlying assumptions and limitations}.
Simply forgetting to include the word `simulated' in the caption of \cref{fcmap}, for example, could lead
the reader to believe it is actual cake consumption that is being described. In the case
of a cakeMap this may not matter but in areas of public health and the environment the consequences
of such oversight could be deadly.
\end{itemize}

Underlying each of these points is a wider responsibility: to communicate one's research
with clarity and transparency. Too much modelling research is shrouded in a cloud of jargon,
unstated assumptions and verbose English. If nothing else, this tutorial should
provide guidance on how to improve standards in the field and move towards best practice
for reproducibility \citep{Peng2006}.

Spatial microsimulation is a powerful tool.
Like any powertool, it can acheive very useful results for its user
but can also cause great harm if used incorrectly. 
Think of a pneumatic drill: this could be used to build new
public infrastructure such as bicycle paths. It could also,
in clumsy hands, be used to destroy existing infrastructure.
The same applies to spatial microsimulation:
at best it can greatly help out with complex research questions such
as the distributional impacts of new transport policies \citep{Lovelace2014-jtg}.
At worst, it can waste valuable research time, create misleading results
and shroud academic research behind an impenetrable wall of jargon.

We have little doubt that the vast majority of
people will prefer the former option. This course has hopefully
equipped its students with
the tools to pursue this lofty aim.

\section{Glossary} \label{gloss}
\begin{itemize}
\item \textbf{Algorithm}: a series of computer commands which are executed in a well
defined order. Algorithms process input data and produce an output.
\item \textbf{Combinatorial optimisation} is an approach to spatial microsimulation that
generates spatial microdata by randomly selecting individuals from a survey dataset 
and measuring the fit between the simulated output and the constraint variables.
If the fit improves after any particular change, the change is kept.
\citet{Williamson2007} provides a practical user manual for implementing the technique in code.
\item \textbf{Data frame}: a type of object (formally referred to as a class)
in R, data frames are square tables composed of rows and columns of
information. As with many things in R, the best way to understand data frames is
to create them and experiment. The following creates a data frame with two
variables: name and height:
\begin{verbatim}
 data.frame(name = c("Robin", "Phil"), height.cm = c(172, 174))
\end{verbatim}
Note that each new variable is entered using the command \verb c() --- this is
how R creates objects with the \emph{vector} data class, a one dimensional
matrix --- and that text data must be entered in quote marks.
\item \textbf{Deterministic reweighting} is an approach to generating spatial
microdata that allocates fractional weights to individuals based on how representative
they are of the target area. It differs from combinatorial optimisation approaches in
that it requires no random numbers. The most frequently used method of deterministic
reweighting is IPF.
\item \textbf{For loops} are instructions that tell the computer to run a certain
set of command repeatedly. \texttt{for(i in 1:9) {print(i)}}, for example will print
the value of i 9 times. The best way to further understand for loops is to try them
out.
\item \textbf{Iteration}: one instance of a process that is repeated many times until
a predefined end point, often within an \emph{algorithm}.
 \item \textbf{Iterative proportional fitting} (IPF): an iterative process implemented in
mathematics and algorithms to find the maximum likelihood of cells that are constrained by
multiple sets of marginal totals. To make this abstract definition even more confusing, there are
multiple terms which refer to the process, including `biproportional fitting' and `matrix raking'.
% \citep{Lovelace2014-IPF}
In plain English, IPF in the context of spatial microsimulation
can be defined as \emph{a statistical technique for allocating weights to individuals depending
on how representative they are of different zones}. IPF is a type of deterministic reweighting,
meaning that random numbers are not needed to generate the result and that the output weights are real (not
integer) numbers.
\end{itemize}

\section{Acknowledgements}
Many thanks to Phil Mike Jones from the University of Sheffield for working through early drafts
of this document and providing very useful input to the teaching materials overall.
Thank you to Amy O'Neill and Rosie Temple for organising the
\href{http://www.ncrm.ac.uk/training/show.php?article=4786}{National Centre for Research Methods (NCRM) course}
``An Introduction to Spatial Microsimulation using R''
and dealing with the high demand. Finally, thanks to the open source software movement
overall, for encouraging best practice in reproducible research and
making powerful tools such as R accessible to anyone, regardless of income, nationality
or status.

% \lhead{\emph{Bibliography}}  % Change the left side page header to
% \fancyhead[LO,RE]{\emph{Bibliography}}
\bibliographystyle{plainnat}  % Use the "unsrtnat" BibTeX style for
% \bibliography{library, lincluded}  % The references (bibliography) information are stored
\bibliography{Microsimulation}  \label{Bibliography}
% ,Microsimulation.bib
% /home/robin/Documents/Microsimulation.bib

\addtocontents{toc}{\vspace{2em}}  % Add a gap in the Contents, for aesthetics

%% -----------------------------------------------------------
\end{document}  % The End