forked from Robinlovelace/smsim-course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhandout.tex
1442 lines (1274 loc) · 71.5 KB
/
handout.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper, 11pt, twoside]{article}
\usepackage[hcentering,bindingoffset=10mm,margin=20mm]{geometry}
\usepackage{graphicx}
\graphicspath{{figures/}} % Location of the graphics files
\usepackage{makeidx}
% \graphicspath{/figures} % Location of the graphics files
\usepackage{multirow}
% Making R code work!
\usepackage{listings}
\usepackage{color}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{float} % allows forcing of float position
% \restylefloat{table}
\usepackage{bm}
\usepackage[parfill]{parskip}
% begins paragraphs with an empty line rather than an indent
% Add toc to contents
\usepackage[nottoc,numbib]{tocbibind}
\hypersetup{urlcolor=blue, colorlinks=false, hypertexnames=true} % Colours hyperlinks in blue, but this can be distracting
\usepackage[capitalise]{cleveref}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\crefname{lstlisting}{listing}{listings}
\Crefname{lstlisting}{Listing}{Listings}
\lstset{ %
language=R, % the language of the code
basicstyle=\normalsize\ttfamily, % the size of the fonts that are used for the code
% numbers=left, % where to put the line-numbers
% numberstyle=\tiny\color{gray}, % the style that is used for the line-numbers
% stepnumber=2, % the step between two line-numbers. If it's 1, each line
% will be numbered
% numbersep=5pt, % how far the line-numbers are from the code
% backgroundcolor=\color{white}, % choose the background color. You must add \usepackage{color}
% showspaces=false, % show spaces adding particular underscores
% showstringspaces=false, % underline spaces within strings
% showtabs=false, % show tabs within strings adding particular underscores
frame=false, % adds a frame around the code
rulecolor=\color{white}, % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. commens (green here))
% tabsize=2, % sets default tabsize to 2 spaces
% captionpos=b, % sets the caption-position to bottom
% breaklines=true, % sets automatic line breaking
% breakatwhitespace=false, % sets if automatic breaks should only happen at whitespace
% title=\lstname, % show the filename of files included with \lstinputlisting;
% also try caption instead of title
keywordstyle=\color{blue}, % keyword style
commentstyle=\color{dkgreen}, % comment style
stringstyle=\color{mauve}, % string literal style
escapeinside={\%*}{*)}, % if you want to add a comment within your code
morekeywords={*,...} % if you want to add more keywords to the set
}
% Include any extra LaTeX packages required
\usepackage[round,]{natbib} % Use the "Natbib" style for the references
\usepackage{verbatim} % Needed for the "comment" environment to make LaTeX comments
\usepackage{wallpaper}
\usepackage{cases}
\usepackage{wrapfig}
\makeindex
% \renewcommand{\includegraphics}[2][]{\fbox{#2}} %omits images
\begin{document}
\title{Introducing spatial microsimulation with R: a practical}
% \author{Robin Lovelace --- R.Lovelace at. Leeds. ac. uk}
\pagestyle{myheadings}
\author{Lovelace, Robin\\
\texttt{[email protected]}}
\maketitle
\tableofcontents
\newpage
\section{Foundations}
This practical teaches the basic theory and practice of `spatial microsimulation'
using the popular free software package R. The term microsimulation means
different things in different disciplines, so it is important to be clear
at the outset what we will and will not be covering.
\emph{We will} be learning how to create \emph{spatial microdata}, the
basis of all spatial microsimulation models, using \emph{iterative
proportional fitting} (IPF). IPF is
an efficient method for allocating individuals from a
non-spatial dataset to geographical zones, analogous to the
`Furness method' in transport modelling, but with more constraints.
There are other ways of generating
spatial microdata but, as far as the author is
aware,\footnote{There is much need
to compare different methods of generating spatial microdata, for example by
building on the work of \citep{harland2012}. A project for doing
these tests in a transparent way has been started at
\href{https://github.com/Robinlovelace/IPF-performance-testing}{github.com/Robinlovelace/IPF-performance-testing}
and there are probably many opportunities for improving the computational efficiency
of the code presented in this course --- please get in touch if you would
like to contribute.}
this is the most effective and flexible for many applications.
An alternative approach using the open source `Flexible Modelling Framework'
program is described in detail, with worked examples, by \citet{harland2013microsimulation}.
\emph{We will not} be learning `dynamic spatial microsimulation'
\citep{Ballas2005b}: once the
spatial microdata have been generated and \emph{integerised}, it is up to the
user how they are used --- be it in an agent based model or as a basis for
estimates of income distributions at the local level or whatever.
We thus define spatial microsimulation narrowly in this tutorial as the process
of generating \emph{spatial microdata} (more on this below). The term
can also be used to describe a wider approach that harnesses individual-level data
allocated to zones for investigating phenomena that vary over space and between individuals
such as income inequality or energy overconsumption. In both cases, the generation
of spatial microdata is the critical element of the modelling process so the skills
learned in this tutorial will provide a firm foundation for further work.
One of the tricky things about spatial microsimulation for newcomers is its use
of specialist language. It is important to know exactly what is meant by
`spatial microdata' and other technical terms. To this end we have created a
glossary that provide succinct definitions (see \cref{gloss}). Any term that is
\emph{italicised} in the text has a glossary entry.
\begin{wrapfigure}{2}{11cm}
\begin{center}
\includegraphics[width=10.5cm]{rstudio}
\end{center}
\caption{The RStudio user interface. Note the project title `smsim-course' in
the top right and the `Files' tab at the top of the left hand window.}
\label{frstudio}
\end{wrapfigure}
\subsection{Prerequisites for this practical}
This practical uses the statical analysis and modelling software
R. We suggest you install and take a look at this powerful program
before getting started. We recommend using R within RStudio \cref{frstudio}, which makes
using R much easier. Instructions to install both R and RStudio can be found
at \href{http://www.rstudio.com/ide/download/desktop}{rstudio.com}.
% Could add screenshot of Dl ZIP here!!!
The other prerequisite for the course is downloading the example data.
These can be downloaded in a single zip file which can be found on the
course's GitHub repository:\footnote{GitHub
is used to serve this .zip file because if anything changes
in the tutorial, the .zip file from the same location will
automatically be updated. The other advantages of GitHub are its
visibility --- a pdf of this tutorial, for example,
is kept up-to-date at
\href{https://github.com/Robinlovelace/smsim-course/blob/master/handout.pdf}{github.com/Robinlovelace/smsim-course}
in the `handout.pdf' file --- accessibility and encouragement of evolution: to contribute
to this project all one needs to do is
`\href{https://help.github.com/articles/fork-a-repo}{fork}' it.
} \\
\href{https://github.com/Robinlovelace/smsim-course}{github.com/Robinlovelace/smsim-course}.
Click on the ``Download ZIP'' button to the right of this page and extract
the folder into your desktop or other suitable place on your computer.
Once the folder has been successfully extracted open it in
your browser and take a look around. You will find a number of files
and two sub-folders
entitled `data', and `figures'. When you get to the
sections that use R code, it is useful for R to operate from
within the smsim-course-master folder. Probably the best way
to set this up is to open the file `smsim-course.Rproj' from
within RStudio (\cref{frstudio}).
Try this now and click on the `Files' tab in
the bottom right hand window of RStudio.
Before using the power of R in RStudio it's worth
understanding a bit about `IPF', the algorithm we will use to generate
the synthetic population or `spatial microdata' (\cref{fmsim-schema}).
\begin{figure}[H]
\begin{center}
\includegraphics[width=7cm]{msim-schema}
\end{center}
\caption{Schema of iterative proportional fitting (IPF) and combinatorial
optimisation
in the wider context of the availability of different data formats and spatial
microsimulation. \label{fmsim-schema}}
\end{figure}
\subsection{Learning by doing}
As \citet[xxii]{kabacoff2011r} put it regarding R, ``the best
way to learn is to experiment'' and the same applies to spatial microsimulation.
We believe you will learn the technique/art best not by reading about it,
but by doing it. Mistakes are inevitable in any challenging task and should not
be discouraged. In fact it is by making blunders, identifying and then correcting them
that many people learn best. Think of someone learning to skate: no one ever
picks up a skateboard for the first time being able to `surf the sidewalks'. It
takes time, patience and plenty of falls before you master the art.
The same applies to spatial microsimulation.
Spatial microsimulation works by taking \emph{microdata} at the individual level
and using aggregate-level constraints to allocate these individuals to zones.
The two main methods are \emph{deterministic reweighting} and \emph{combinatorial optimisation}.
This practical takes the former approach using a process called \emph{iterative proportional fitting}
(IPF). IPF is used to increase the weights of individuals who are representative
of the target area and reduce the weights of individuals who are relatively
rare \citep{Lovelace2013-trs}.
The output is a \emph{spatial microdataset}.
A schematic of the process is shown in \cref{fmsim-schema}.
Take a look at the image and think about the process. But we don't want to get
bogged down in theory or applications in this course: we want to `get our
hands dirty'. Next we will do just that, by applying the process to some
example data.
\subsection{Input data} \label{s:theory}
Let's start with two very basic datasets. To aid understanding, we will primarily do
the reweighting by hand to understand the process before automating it on the computer.
As outlined in Figure~\ref{fmsim-schema},
there are two input files required to perform spatial microsimulation:
\begin{itemize}
\item Survey data - information about individuals
\item Geographically aggregated (`wide') data - typically aggregated Census tables
\end{itemize}
Table \ref{t:w} shows 5 individuals, who are defined by two constraint variables:
age and sex. This is analogous to the survey data shown in Figure~\ref{fmsim-schema}.
Table \ref{t:m} presents this same data in aggregated form, whose margin totals can
be used in the IPF procedure described shortly. Note that individual 3 is highlighted in
blue and bold for future reference.
\begin{table}[h]
% {r}{10cm}
\caption{Hypothetical input microdata (`survey data')
(the original
weights set to one). The bold and blue value is used subsequently for
illustrative purposes.
}
\label{t:w}
\begin{center}
\begin{tabular}{llll}
\toprule
{Individual } & {Age} & {Sex} & {Weight} \\
\midrule
1 & 59 & Male & 1 \\
2 & 54 & Male & 1 \\
3 & {35} & {Male} & \textbf{\color{blue}1} \\
4 & 73 & Female & 1 \\
5 & 49 & Female & 1 \\
% 1 & 59 & m \\
% 2 & 54 & m \\
% 3 & 35 & m \\
% 4 & 73 & f \\
% 5 & 49 & f \\
\bottomrule
\end{tabular}
\end{center}
\end{table}
\begin{table}[htbp]
\centering
\caption[The aggregated results of the weighted
microdata set]{The aggregated results of the weighted
microdata set ($m$ --- think ``m'' for Modified or siMulated).
Note, these values are updated by each new set of weights.
% so change after each iteration.
% hence the number ``(1)''
% --- this table is for the initial weight.
}
\begin{tabular}{cllll}\toprule
Marginal totals& & \multicolumn{2}{c}{$j$} & \\
& Age/sex & Male & Female & T \\ \midrule
\multirow{2}{*}{$i$} & Under-50 & \textbf{\color{blue}1} & 1 & \textbf{\color{blue}2}\\
& Over-50 & 2 & 1 &3 \\
& T & 3 & 2 &5\\
\bottomrule
\end{tabular}
\label{t:m}
\end{table}
Table \ref{t:s} contains data for a single area. This is analogous to the
geographically aggregated 'wide' data shown in Figure~\ref{fmsim-schema}.
\begin{table}[H]
\centering{
\caption{Small area constraints (geographically aggregated or `wide' input data).
Note that although individual 3 fits into both ``under 50'' and ``male'' categories,
only the value of the former is highlighted as age is the first constraint. \label{t:s}}
\begin{tabular}{cllll}
\toprule
Constraint $\Rightarrow$ & \multicolumn{2}{c}{$i$}& \multicolumn{2}{c}{$j$}\\
Category $\Rightarrow$ & $i_1$ & $i_2$ & $j_1$ & $j_2$ \\
Area $\Downarrow$ & Under-50 & Over-50 & Male & Female\\
1 & \textbf{\color{blue}8} & 4 & 6 & 6\\
\bottomrule
\end{tabular}}
\end{table}
Now, it will be recognised by the astute reader that the data in
Tables \ref{t:m} and \ref{t:s} are incompatible; they are
two matrices with different dimensions ($3 \times 3$ and $1 \times 4$ respectively).
To solve this problem we can rearrange the small area census constraints data
($s$, presented in Table \ref{t:s}) to match the dimensions of
$m(1)$ (table \ref{t:m}).
\Cref{t:s2} illustrates $s$ in a different form.
Note that although we are ignorant of the internal values of this matrix,
it is of the same dimension as $m(1)$ and the marginal totals (denoted ``T'' for
total) can be directly compared. This is the basis of the method and provides a
way to systematically compare individual level data with geographic aggregates.
\begin{table}[h]
\centering
\caption[Small area constraints expressed as marginal totals]{Small
area constraints (matrix $s$) re-arranged to be the same dimension as $m$ (Table \ref{t:m}).
``s'' here stands for Small area data or conStraint. Note only
the marginal totals are known. The internal cell
values (represented by ``?'') must be estimated by spatial microsimulation.
}
\begin{tabular}{cllll}\toprule
Marginal totals& & \multicolumn{2}{c}{$j$} & \\
& Age/sex & Male & Female & T\\ \midrule
\multirow{2}{*}{$i$} & Under-50 & \textbf{?} & ? & \textbf{\color{blue}8}\\
& Over-50 & ? & ? &4 \\
& T & 6 & 6 &12\\
\bottomrule
\end{tabular}
\label{t:s2}
\end{table}
With our input data in place, we can begin our first iteration.
\subsection{The IPF equation} \label{ipfeq}
% \begin{wrapfigure}{R}{9cm}
% \includegraphics[width=8cm]{equation1}
% \caption{\Cref{eq:ipf} in word form.}
% \label{feq}
% \end{wrapfigure}
Using the tables $m$ and $s$ presented above we readjust the weights of the
individuals so that:
\begin{itemize}
% \item If there are exactly the same number of
\item Individuals who are rare in the geographic zone of interest are given a smaller weight (we need fewer of them).
\item Individuals who are common in the zone of interest are given a larger weight (we need more of them).
\item Their sum equals the total population of the geography of interest (12 in our example).
\end{itemize}
For each constraint the current weight (Table~\ref{t:m}) is multiplied by the respective
total from the geographically aggregated (census) table (Table~\ref{t:s}) and
divided by the corresponding marginal total of the survey data (see \cref{t:m}).
For example, using the age constraint the new weight of individual 3 is calculated
based on the following numbers: the initial weight is set as 1, the number of
individuals under 50 in the geographically aggregated (Census) data is 8 (see Table \ref{t:s})
and the number
of individuals under 50 in the survey data is 2 (see Table~\ref{t:m}). Thus the new weight is
$1 \times \frac{8}{2}$. These figures
are highlighted in \textbf{\color{blue}bold and blue} in the preceding tables.
The same process is used to calculate the new weights for all individuals --- remember
we are only constraining by age for now.
% Test your understanding by calculating
% the value of the ? symbols in table~\ref{t:new-weights-part-complete} below.
%
% \begin{table}[htbp]
% \centering
% \caption{Partially complete reweighted microdata.}
% \begin{tabular}{lllll}
% \toprule
% {Individual} & {Sex} & {age-group} & {Weight} &
% {New weight, $w(2)$} \\ \midrule
% 1 & Male & Over-50 & 1 & ? \\
% 2 & Male & Over-50 & 1 & ? \\
% 3 & Male & Under-50 & 1 & $1 \times \frac{\color{blue}8}{\color{blue}2} = 4$ \\
% 4 & Female & Over-50 & 1 & ? \\
% 5 & Female & Under-50 & 1 & ? \\
% \bottomrule
% \end{tabular}
% \label{t:new-weights-part-complete}
% \end{table}
%
This process of \emph{reweighting} is done one constraint at a time and can
be described more succinctly in the form of a word equation
\cref{eq:ipf-word}, and more generally by \cref{eq:ipf} for constraint $i$ (age):
\begin{equation}
Weight_{New} = Weight_{Current} \times \frac{Census_{Constraint \: Total}}{Survey_{Constraint \: Total}}
\label{eq:ipf-word}
\end{equation}
\begin{equation}
w(n+1)_{ij} = \frac{w(n)_{ij} \times sT_{i}}{mT(n)_{i}}
\label{eq:ipf}
\end{equation}
where $w(n+1)_{ij}$ is the new weight for individuals with characteristics $i$
(age, in this case), and $j$ (sex), $w(n)_{ij}$ is the original
weight for individuals with these characteristics, $sT_{i}$ is element
marginal total of the small area constraint, $s$
(Table \ref{t:s}) and $mT(n)_{i}$ is the marginal total of category
$j$ of the aggregated results of the weighted
microdata, $m$ (Table \ref{t:m}).
$n$ represents the constraint number.
Do not worry about completely understanding the above procedure for now, its
meaning will probably take time to sink in.
More important is implementing it.
Follow the blue emboldened values in tables 1 to 4 to fill in the
question marks in \cref{t:new-weights} and calculate the new weight
of individual 3, a male under 50 years of age.
To validate your calculation, the sum of weights ($W_1$ to $W_5$)
should equal the population of the zone (12, in our example). This is a
useful check.
\begin{table}[htbp]
\centering
\caption{Reweighting the hypothetical microdataset in order to fit
Table \ref{t:s}.}
\begin{tabular}{lllll}
\toprule
{Individual} & {Sex} & {age-group} & {Weight} &
{New weight, w(2)} \\ \midrule
1 & Male & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
2 & Male & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
3 & Male & Under-50 & 1 & $1 \times \frac{\textbf{\color{blue}8}}{\textbf{\color{blue}?}} = \textbf{\color{blue}?}$ \\
4 & Female & Over-50 & 1 & $1 \times 4/3 = \frac{4}{3}$ \\
5 & Female & Under-50 & 1 & $1 \times \frac{8}{2} = 4$ \\
\midrule
Total & & & 5 & 12 \\
\bottomrule
\end{tabular}
\label{t:new-weights}
\end{table}
\subsection{Re-aggregation after the first constraint} \label{sreag}
After creating $Weight_{New}$ for all possibilities of the first constraint (under 50 and 50$+$ for age),
the individual level data is re-aggregated before the next weight can be calculated using the next constraint.
Re-aggregating the individual-level data --- to compare the marginal totals with the constraint tables
ready for the second constraint --- will result in
\cref{t:m2}. To generate this table, the new weights ($w(2)$,
presented in \cref{t:new-weights}) are multiplied by the
number of individuals in each category. Thus, to calculate the new
estimate for the number of males we multiply the weight of
males under 50 (4 --- follow the emboldened values above) by the
number of individuals in this category (1) and add the corresponding values
for males over 50 ($\frac{4}{3}$ and 2). This is as follows
(see the bold value in the bottom left of \cref{t:m2}):
\begin{equation}
\sum_{i = male} m(2) \; = \; \textbf{\color{blue} 4} \times \textbf{\color{blue}1} + \frac{4}{3} \times 2 \; = \;
\textbf{\color{blue}4} + \frac{8}{3} \; = \textbf{\color{blue}6} \frac{\textbf{\color{blue}2}}{\textbf{\color{blue}3}}
\end{equation}
After performing this process of re-aggregation (\cref{t:m2}) for all categories
in the **age** constraint,n
the next stage is to repeat \cref{eq:ipf} for the **sex** constraint to generate a
third set of weights, by replacing
the $i$ in $sT_{i}$ and $mT(n)_{i}$ with $j$ and incrementing the value of n:
\begin{equation}
w(3)_{ij} = \frac{w(2)_{ij} \times sT_{j}}{mT(2)_{j}}
\label{eq:ipf2}
\end{equation}
\subsection{Test your understanding: the second constraint}
To test your understanding of IPF, try to apply \cref{eq:ipf2} to the
information above
and that presented in \cref{t:m2}.
This should result in the following vector of new weights, for individuals 1 to
5. Calculate the correct values and pencil them in in place of the question
marks. % Add online link to answers - tinyurl
\begin{equation}
w(3) = (\frac{6 \times \frac{4}{3}}{6 \frac{2}{3}},\;\;\; \frac{6 \times
\frac{\;4\;}{\;3\;}}{\;?\;},\;\;\; \boldsymbol{ \frac{\color{blue}6 \times 4}{\color{blue} 6\frac{2}{3}
}}, \;\;\;\frac{\;?\; \times \;?\;}{?}, \;\;\;\frac{4 \times 6}{5 \frac{1}{3}})
% w(3) = (\frac{6}{5}, \frac{6}{5}, \frac{18}{5}, \frac{3}{2}, \frac{9}{2})
\end{equation}
After simplifying these fractions, the results are as follows.
One `sanity' check of your method here is whether the sum of these
weights is still equal to the area's total population of twelve. Test this is
true:
\begin{equation}
% w(3) = (\frac{6}{5}, \frac{6}{?}, \boldsymbol{ \frac{4 \times 6}{ 6\frac{2}{3}
% }}, \frac{?}{?}, \frac{9}{2})
w(3) = (\frac{6}{5}, \frac{6}{5}, \boldsymbol{\frac{\color{blue}18}{\color{blue}5}}, \frac{3}{2},
\frac{9}{2})
\label{finalout}
\end{equation}
What do the weights in w(3) actually mean? They indicate how representative
each individual is of the target zone after one iteration of IPF, constraining
by age and sex. Individual number 5 has the highest weight because there is
only one young female in the survey dataset yet seemingly many in the area in
question.
Notice also that after each iteration the fit between the marginal
totals of $m$ and $s$
improves.\footnote{This can be checked by comparing the aggregated weighted
individuals with the small area constraints. Total absolute error (TAE),
defined as the sum of all differences between simulated and observed marginal
totals, improves between $m(1)$ to $m(2)$, falling from
14 to 6 in \cref{t:m} and \cref{t:m2} above. TAE for $m(3)$ (not shown,
but calculated by aggregating $w(3)$) improves even more, to 1.3.
This number would eventually converge to 0 through subsequent
iterations, a defining feature of IPF.}
\begin{table}[htbp]
\centering
\caption[Aggregated results after constraining for age]{The
aggregated results of the weighted
microdata set after constraining for age ($m(2)$).
}
\begin{tabular}{cllll}\toprule
Marginal totals& & \multicolumn{2}{c}{$i$} & \\
& Age/sex & Male & Female & T\\ \midrule
\multirow{2}{*}{$j$} & Under-50 & \textbf{\color{blue}4} & 4 & 8\\
& Over-50 & $\frac{8}{3}$ & $\frac{4}{3}$ & 4 \\
& T & $\boldsymbol{{\color{blue}6}\frac{\color{blue}2}{\color{blue}3}}$ & 5$\frac{1}{3}$ & 12\\
\bottomrule
\end{tabular}
\label{t:m2}
\end{table}
% A key benefit from a policy perspective is that
% IPF and other spatial microsimultion techniques
% can provide estimation of variables whose values are not
% known at the local level (e.g. income).
\subsection{IPF in a spreadsheet}
% Computer code is absent from most people's daily lives so the
% R code we use to automate IPF will likely seem like a foreign language
% to many. Spreadsheet programs like Microsoft Excel and LibrOffice Calc are
% comparatively well known, though.
R is an unfamiliar programme to many, so before we move on to producing an R script that can calculate weights, we will carry out an intermediary step in a spreadsheet programme.\footnote{This
example was produced by Phil Mike Jones --- see \href{http://www.philmikejones.net/}{philmikejones.net}.}
For small examples like the one we are using, a spreadsheet has the two main advantages that: it is already familiar to many so can act as a bridge to calculating IPF using a computer; and it is easier to see and recall more than one table in a spreadsheet compared to R with R Studio.
Open sms-spreadsheet-exercise.xlxs if you are using Microsoft Excel or sms-spreadsheet-exercise.odt (note the different file types) if you are using LibreOffice/OpenOffice Calc and follow the tasks outlined in the file. The steps are exactly the same as the pen and paper example above to allow you to more easily check your answers; to help with the transition of performing the calculations of IPF in a computer programme; and to aid recall of the procedure through repetition.
\section{IPF in R} \label{simplementing}
% priority: automate!!!
So far we have implemented IPF by hand and in a spreadsheet.
This section explains how the IPF
\emph{algorithm} described above is implemented in R, using the
same input data. Now we will be working within the RStudio
environment with the `smsim-course' project
loaded and the `simple.R' file open in the top left panel
(\cref{frsimple}).
\begin{figure}
\begin{center}
\includegraphics[width=15cm]{rs-simple}
\end{center}
\caption{The RStudio user interface with the `simple.R' file loaded in
the top left window and some output displayed in the bottom left window.}
\label{frsimple}
\end{figure}
Note a couple of things from this figure:
there are many explanations of the code and these are commented out
by the \texttt{\#} ('hash') symbol. Also not in the bottom left panel
R displays the output of the commands we send it from the script.
Hitting the \texttt{Ctrl} and \texttt{Enter} keys simultaneously from
within the upper (script) window will cause R to run these commands,
line by line. In the code blocks below, any outputs from R are preceded
by `\texttt{\#\#}'. Comparing these outputs with the output on your screen will
ensure that R is behaving as it should.
We will spend the rest of this section working through
the commands contained in this file.\footnote{A longer
tutorial is available from Rpubs, a site dedicated
to publishing R analyses that are reproducible:
\href{http://rpubs.com/RobinLovelace/5089}{rpubs.com/RobinLovelace/5089}.}
% This section is based on ``Spatial microsimulation in R: a
% beginner’s guide to iterative proportional fitting (IPF)'', a tutorial
% written to accompany a methods paper on integerisation
The code snippets are taken from the `simple.R' script file (just a plain text
document, like all R scripts) so you can copy and paste. However, it is
highly recommended to not look at this file until you need to, towards
the end of this
section. This is learning by typing, as advocated by \citet{shaw2013learn}.
\subsection{Loading and exploring the input data}
The data presented in the above worked example are saved in the
`simple' sub-folder of `data' as .csv files. To load them in R,
we use the following commands:
\begin{lstlisting}[float=h, caption={Loading the input data in R}, label=cusd]
ind <- read.csv("data/simple/ind.csv") # load the individual level data
cons <- read.csv("data/simple/cons.csv") # load aggregate constraints
\end{lstlisting}
To check that the \emph{data frames} have loaded correctly, simply
ask R to print their contents by entering the object's name. To display the
individual level data, for example, simply type \verb/ind/:\footnote{An alternative
solution is to click the object's name in `Environment' in
RStudio's top right window.}
\begin{lstlisting}[float=h, caption={Checking the contents of the individual
level data frame}, label=cout]
ind # Show the data frame in R
## id age sex
## 1 59 m
## 2 54 m
## 3 35 m
## 4 73 f
## 5 49 f
\end{lstlisting}
Note that the input data is identical to the tables illustrated above for
consistency. Instead of constraining for just one zone as we did in the example
above, we will fit the data to six zones here. To subset a dataset, add square
brackets to the end of the object's name in R. Entering \verb cons[1:2, ] ,
for example, should output the first two rows of the constraint variable.
Within the square brackets the number before the comma refers to rows (blank
means all rows) and the numbers after the comma refer to columns. If the output
of this command looks like the text below, congratulations are in order:
you have successfully loaded a subset of the constraint dataset.
\begin{lstlisting}[float=h, caption={Printing a subset of the constraint data
in R},
label=cou]
cons[1:2, ]
## X16.49 X50. m f
## 1 8 4 6 6
## 2 2 8 4 6
\end{lstlisting}
Note that the top row is identical to \cref{t:s} --- we can therefore compare
the results of doing IPF on the computer with the results obtained by hand.
% !!! Manual input of data for book!!!
% Usually the input data for spatial microsimulation are loaded from
% .csv files, one for each
% constraint and one for the input microdata. These are read-in with the command
% \verb read.csv . For the purposes of understanding how the model works,
% the dataset is read line by line, following the
% example above. The following code creates example datasets,
% based on the same hypothetical survey of 5 individuals described above,
% and 5 small areas. The spatial microsimulation model will select individuals
% based on age and sex and mode of transport (mode of transport
% is also used on the larger online example described in footnote \ref{fnrpub}).
% For consistency with the (larger) model used for the paper,
% the individual level data will be referred to as USd (Understanding Society dataset)
% and the geographic data as all.msim (for all constraint variables).
% The code to read-in the individual level data are presented in code sample \ref{cusd}.
% When called, the data are then displayed as a table (see listing \ref{cout}).
% \begin{lstlisting}[float=h, caption={Manual input of individual level data
% in R}, label=cusd]
% # Read in the data in long form (normaly read.table() used)
% c.names <- c("id", "age", "sex")
% USd <- c( 1, 59, "m",
% 2, 54, "m",
% 3, 35, "m",
% 4, 73, "f",
% 5, 49, "f")
% USd <- matrix(USd, nrow = 5, byrow = T) # Long data into matrix
% USd <- data.frame(USd) # Convert this into a dataframe
% names(USd) <- c.names # Add correct column names
% USd$age <- as.numeric(levels(USd$age)[USd$age]) # Age is a numeric
% \end{lstlisting}
% \begin{lstlisting}[float=h, caption={Output of the USd data frame}, label=cout]
% USd # Show the data frame in R
% ## id age sex
% ## 1 1 59 m
% ## 2 2 54 m
% ## 3 3 35 m
% ## 4 4 73 f
% ## 5 5 49 f
% \end{lstlisting}
% The same procedure applies to the geographical data (listing \ref{cgeo}).
% \begin{lstlisting}[float=h*, caption={Geographic data input}, label=cgeo]
% category.labels <- c("16-49", "50+" # Age constraint
% ,"m", "f" # Sex constraint
% # more constraints could go here
% )
% all.msim <- c( 8, 4, 6, 6, # Original aggregate data
% 2, 8, 4, 6, # Elderly
% 7, 4, 3, 8, # Female dominated
% 5, 4, 7, 2, # Male dominated
% 7, 3, 6, 4 # Young
% )
% all.msim <- matrix(all.msim, nrow = 5, byrow = T)
% all.msim <- data.frame(all.msim) # Convert to dataframe
% names(all.msim) <- category.labels # Add correct column names
% \end{lstlisting}
% IPF relies on the assumption that all constraint variables will contain the
% same number of people. This is logical (how can there be more people classified
% by age than by sex?) but can cause problems for constraint variables that use
% only a subset of the total population, such as those who responded to questions on
% travel to work. To overcome this problem, it is possible to normalise the
% constraint variables, setting the total for each to the one that has the most
% reliable total population. This worked example simply checks whether
% or not they are (listing \ref{ccheck}).
% \begin{lstlisting}[float=h, caption={R code to check the constrain populations
% match}, label=ccheck]
% # Check totals for each constraint match
% rowSums(all.msim[,1:2]) # Age constraint
% ## [1] 12 10 11 9 10
% rowSums(all.msim[,3:4]) # Sex constraint
% ## [1] 12 10 11 9 10
%
% rowSums(all.msim[,1:2]) == rowSums(all.msim[,3:4])
% ## [1] TRUE TRUE TRUE TRUE TRUE
% \end{lstlisting}
Using the subset command just introduced, we must now subset the constraint (\verb cons ) data frame so we can access each constraint (age or sex) separately. Using your knowledge of the subsetting command, we select columns 1 to 2, and 3 to 4 leaving all rows in place. Give these subsets (constraints) the name \verb con1 and \verb con2 respectively:
\begin{lstlisting}[float=h, caption={Subsetting the constraint file}]
con1 <- cons[ ,1:2]
con2 <- cons[ ,3:4]
\end{lstlisting}
To make the individual level dataset comparable to the aggregate level we must
first categorise it and count the number in each category. This job is
performed by the R \emph{script} categorise.R, contained in the `simple' data
folder. Run it by entering
\verb source("data/simple/categorise.R")) .\footnote{Note
that the
`categories.R' script must be modified each time it is used with a different
dataset.}
This results in the creation of a
new object called \texttt{ind.cat} in R, which is a matrix of 0s and
1s with same number of columns as the constraint data --- try
exploring \texttt{ind.cat} using the square bracket notation mentioned above.
To check that the script has worked
properly, lets count the number of individuals it contains:
\begin{lstlisting}[float=h, caption={Output from ind.cat}]
colSums(ind.cat)
## a16.49 a50. m f
## 2 3 3 2
\end{lstlisting}
The sum of both age and sex variables is 5 (the total number of individuals): it
worked! Now the data is in the same form as the constraint variables and we
have checked the data makes sense, we can create the `weight' matrix and begin
reweighting the individuals to zones (\cref{lscreate}).
\begin{lstlisting}[float=h, caption={Creating the weight and aggregated
individual level matrices --- see lines 21 and 22 in `simple.R'}, label=lscreate]
weights <- array(1, dim=c(nrow(ind),nrow(cons)))
ind.agg <- matrix(rep(colSums(ind.cat), nrow(cons)), nrow(cons) )
\end{lstlisting}
The code in \cref{lscreate} may not make total sense. Do not worry about this for now, you can always
check what is going on later by checking R's documentation (e.g. try entering
\verb ?rep ) or
by pressing the tab key when entering arguments for R commands in RStudio. For the time
being, it is best to press-on with the example and understand the concepts ---
the details can be looked at later.
\subsection{Reweighting by the age constraint}
With all of these objects in place, we are ready to begin allocating new weights
to the individuals via IPF. Below is written in code the IPF formula presented
in \cref{ipfeq}. It begins with a couple of \emph{for loops}, one to iterate through
each zone (hence the \texttt{1:nrow(cons)} argument, which means ``from
1 to the number of zones in the constraint data'') and one to iterate
through each category within constraint 1 (under 50 and over 50 in this case).
Before running this code, ensure that all previous commands in the `simple.R'
script have been performed.
\begin{lstlisting}[float=h, caption={Calculating new weights for the individuals
based on constraint 1 (age) via IPF --- see line 30 and beyond in `simple.R'}]
for (j in 1:nrow(cons)){
for(i in 1:ncol(con1)){
weights[which(ind.cat[,i] == 1),j] <- con1[j,i] / ind.agg[j,i]}}
\end{lstlisting}
The above code updates the weight matrix by dividing
the census constraint (\texttt{con1}) by the aggregated version
of the individual level data, just as in \cref{ipfeq} above.
Further explanation is provided in \cref{sipfcake}: for now let's push on
to the next block of code, which updates the aggregated data (named
\texttt{ind.agg} in R) based on the the new weights.
\subsection{Re-aggregation}
\begin{lstlisting}[float=h, caption={Re-aggregation of the individual
level data based on the new weights (see lines 35 and 36 in `simple.R')}]
for (i in 1:nrow(cons)){ # convert con1 weights back into aggregates
ind.agg[i,] <- colSums(ind.cat * weights[,i])}
\end{lstlisting}
The above line of code multiply the new weights by the
flat version of the individual-level data, one zone (i) at a
time. This is equivalent to the re-aggregation stage described in
\cref{sreag} and the result can be checked by printing the first
row of our new \texttt{ind.agg} object:
\begin{lstlisting}[float=h, caption={The new aggregate (marginal) totals for each category
after weights have been constrained by age. Compare with \cref{t:m2} above.}]
ind.agg[1, ]
## [1] 8.000000 4.000000 6.666667 5.333333
\end{lstlisting}
To recap, we have updated `ind.agg' with new values
that are constrained by the age variable. This seems to improve the fit
between the census and individual-level data as the marginal totals for
age are now equal to the aggregate values for age of zone 1.
This is further
verified in the `simple.R'
script file with a line of code that calculates the Total Absolute Error
after each iteration --- \verb sum(abs(ind.agg - cons)) .
Check that the error decreases with each iteration (see \cref{svalid} for
more on model checking and validation).
% At this stage you should switch to running the code in `simple.R'. In RStudio
% to run a line of code from the top left window, simply press \verb Ctl-Enter .
Continue to enter the code contained in `simple.R' line by line to constrain
by the second constraint, sex.
Notice that we have done in code what was previously done by
hand and in a spreadsheet. % TODO: and in a spreadsheet!
Depending on which method you feel most comfortable with, it is worth
spending time repeating the previous steps to ensure they make sense.
The other advantage of running the same process 3 times in 3 different ways
is that we can cross-compare the results from each.
Compare
the following listing with \cref{finalout}, for example: are the weights the same?
\begin{lstlisting}[float=h, caption={W(3) after constraining by age and sex in R.
Compare with \cref{finalout}}]
weights3[,1] # check the weights allocated for zone 1
## [1] 1.2 1.2 3.6 1.5 4.5
\end{lstlisting}
% \newpage
\subsection{Iterations}
In the above example, we have seen the `bare bones' of spatial microsimulation
using IPF to generate weights from which sample populations can be created.
To perform multiple iterations of the same model, we have prepared a slightly
more complex script than `simple.R' called `simple-iterations.R' that can
also be found in the project's root directory. We set the number
of iterations manually (just
change the value of \verb num.its ). This happens by calling another
script called `e2.R' which runs the re-weighting process
repeatedly, once for each iteration. Notice that only 2 full
iterations, the perfect solution is reached (r = 1).
There is great scope for taking the analysis further:
some further tests and plots are presented on the on-line
versions of this section. The simplest case is contained in
Rpubs document \href{http://rpubs.com/RobinLovelace/6193}{rpubs.com/RobinLovelace/6193}
and a more complex case (with three constraints) can be found in Rpubs document
\href{http://rpubs.com/RobinLovelace/5089}{5089}. For now, however,
we progress to a more complex example, CakeMap.
\section{CakeMap: A worked example}
In this section we will process real data to arrive at an important result:
an estimate of the amount of cake eaten in different wards in West Yorkshire. The
example is deliberately rather absurd: hopefully this will make
the technique more memorable. An attempt has been made to present the method
in a generalisable was as possible, allowing users to apply the technique described
here to their own data.
The code needed to run the main part of the example is contained within `cMap.R'.
Note that this script makes frequent reference to files contained in the `cakeMap'
folder `data'. \verb read.csv("data/cakeMap/ind.csv") , for example, is used to read
the individual level data into R.
Because this example uses real world data (albeit in anonymised form),
we will see how important the process of `data cleaning' is.
Rarely are datasets downloaded in exactly the right form for them to
be pumped directly into a spatial microsimulation model. For that reason
we describe the process of loading and preparing the input data first:
similar steps will be needed to use your own data as an input into a
spatial microsimulation model.
\subsection{Loading and preparing the input data}
The raw constraint variables for CakeMap were downloaded from
the Infuse website (\href{http://infuse.mimas.ac.uk/}{infuse.mimas.ac.uk/}).
These, logically enough, are stored in the `cakeMap/data/' directory
as .csv files and contain the word `raw' so it is clear which files
contain the raw data. The file `age-sex-raw.csv', for example is the raw
age and sex data that was downloaded. As the screenshot in \cref{fraw} shows,
these datasets need to be
processed before they can used as an input
in our model. `con1.csv' stands for
`constraint 1' and represents this age/sex data after it has been processed.
\begin{figure}
\includegraphics[width=12cm]{raw-data-screenshot}
\centering
\caption{The raw input data for the CakeMap model, downloaded from the InFuse website.}
\label{fraw}
\end{figure}
To ensure reproducibility in the process of converting the raw data into
this `spatial microsimulation ready' dataset, all of the steps used to
clean and rearrange the data have been saved. Take a look at the R script file
`process-age.R' --- these are the kinds of steps that the researcher will need to
undertake before performing a spatial microsimulation model.
Here is not the place to delve into the details of data reformatting;
there are resources dedicated to that \citep{tidy-data, kabacoff2011r}.
However, it is worth taking a look at the `process-age.R' script and the
other `process*' files to see how R can be used to quickly
process complex raw data into a form that is
ready for use in spatial microsimulation.
The input data generated through this process of data preparation are named
`con1.csv' to `con3.csv'. For simplicity, all these were merged into a single
dataset called `cons.csv'. All the input data for this section
are loaded with the following commands:
\begin{lstlisting}[float=h, caption={Loading the input data for CakeMap (see `cMap.R')}]
ind <- read.csv("data/cakeMap/ind.csv")
cons <- read.csv("data/cakeMap/cons.csv")
\end{lstlisting}
Take a look at these input data using the techniques learned in the previous
section. To test your understanding, try to answer the following questions:
what are the constraint variables? How many individuals are in the survey microdataset?
How many zones will we generate spatial microdata for?
For bonus points that will test your R skills as well as your practical knowledge
of spatial microsimulation, try constructing queries in R that will automatically
answer these questions.
It is vital to understand the input datasets before trying to model them, so
take some time exploring the input. Only when satisfied with your understanding of
the datasets (a pen and paper can help here, as well as R!) is it time
to move on to generate the spatial microdata using IPF.
\subsection{Performing IPF on the CakeMap data}
\label{sipfcake}
The R script used to perform IPF on the CakeMap data is almost identical to
that used to perform the process on the simple example in the previous section.
The main difference is that there are more constraints (3 not 2) but otherwise
the code is very similar. Confirm this for yourself: it will be instructive
to compare the `simple.R' and
`\href{https://github.com/Robinlovelace/smsim-course/blob/master/cMap.R}{cMap.R}'
script files. Note that
the former works on a tiny example of 5 input individuals whereas the latter
calculates weights for 916 individuals. The code is generalisable, meaning
that the IPF process would still work given slightly different inputs.
The script would still work if more individuals or zones were added although
the code will need to be adjusted to accept differenct constraints.
To ensure the process works with different
constraints, the file `categorise.R' must also be.
Note that `cMap.R' contains 100 lines --- longer than the `simple.R' file.
The additional length is due
to the third constraint, additional iterations (see line 66 onwards) and some
preliminary analysis. The best way to understand what is happening in this
script is simply to read through it and run chunks of it in R (remember the
\verb Ctl-Enter ~ command) to see what happens. Try to `think like a computer'
and imagine what each line of code is doing. This is inevitably difficult
at first so it is recommended that users add more comments to lines
that cause confusion to help remember what is going on. To help this process,
the meaning of some of the trickier and more important lines is explained below.
One potentially confusing area is the representation of the \texttt{weights} object,
used to store the weights after each constraint. To understand what is happening
try entering this code into R and seeing what happens:
\begin{lstlisting}[float=h, caption={Creating the initial weight matrix --- see lines
28 to 30 in cMap.R}]
# create weights in 3D matrix (individuals, areas, iteration)
weights <- array(dim=c(nrow(ind),nrow(cons),num.cons+1))
weights[,,num.cons+1][] <- 1 # sets initial weights to 1
ini.ws <- weights[,,num.cons+1]
\end{lstlisting}
In the above code, \texttt{weights} is created by the \texttt{array}
command, which in fact creates many 2 dimensional matrices in a
`data cube'. This is different from the weight object used in the
`simple.R' script, which had only 2 dimensions.
If this is confusing, just remember that
the `square' weight matrix --- where each column represents a zone
and each row represents an individual --- can be returned to by
specifying the third dimension. Dimensions in R are specified by
the square brackets, so the initial weights can be called by typing
\texttt{weights[,,4]}. This should result in hundreds of 1s, the initial weights.
To print a more manageable-sized output, remember how to subset the other
dimensions: \texttt{weights[1:3,1:5,4]}, for example, will print the first
3 rows (individuals) and first 5 columns of the 4th matrix. Why do we do it like this?
So that we only need one weights object to remember all the previous weights
so we can go back and check whether the model fit is improving from one constraint
and one iteration to the next.
Now, when the new weights are set after iteration 1, they are saved as follows,
within a nested \emph{for loop} that iterates over all zones (j) and
constraint categories (i):
\begin{lstlisting}[float=h, caption={Saving the new weights after the first
constraint of IPF --- see line 43 in `cMap.R'}, label=cmapipf]
weights[which(ind.cat[,i] == 1),j,1] <- con1[j,i] /ind.agg[j,i,1]}}