-
Notifications
You must be signed in to change notification settings - Fork 1
/
acoracdr.tex
1243 lines (1066 loc) · 58.6 KB
/
acoracdr.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% This is file `elsarticle-template-2-harv.tex',
%%
%% Copyright 2009 Elsevier Ltd
%%
%% This file is part of the 'Elsarticle Bundle'.
%% ---------------------------------------------
%%
%% It may be distributed under the conditions of the LaTeX Project Public
%% License, either version 1.2 of this license or (at your option) any
%% later version. The latest version of this license is in
%% http://www.latex-project.org/lppl.txt
%% and version 1.2 or later is part of all distributions of LaTeX
%% version 1999/12/01 or later.
%%
%% The list of all files belonging to the 'Elsarticle Bundle' is
%% given in the file `manifest.txt'.
%%
%% Template article for Elsevier's document class `elsarticle'
%% with harvard style bibliographic references
%%
%% $Id: elsarticle-template-2-harv.tex 155 2009-10-08 05:35:05Z rishi $
%% $URL: http://lenova.river-valley.com/svn/elsbst/trunk/elsarticle-template-2-harv.tex $
%%
%%\documentclass[preprint,authoryear,12pt]{elsarticle}
%% Use the option review to obtain double line spacing
%% \documentclass[authoryear,preprint,review,12pt]{elsarticle}
%% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn
%% for a journal layout:
%% Astronomy & Computing uses 5p
%% \documentclass[final,authoryear,5p,times]{elsarticle}
\documentclass[final,authoryear,5p,times,twocolumn]{elsarticle}
%% if you use PostScript figures in your article
%% use the graphics package for simple commands
%% \usepackage{graphics}
%% or use the graphicx package for more complicated commands
\usepackage{graphicx}
%% or use the epsfig package if you prefer to use the old commands
%% \usepackage{epsfig}
%% The amssymb package provides various useful mathematical symbols
\usepackage{amssymb}
%% The amsthm package provides extended theorem environments
%% \usepackage{amsthm}
\usepackage[pdftex,pdfpagemode={UseOutlines},bookmarks,bookmarksopen,colorlinks,linkcolor={blue},citecolor={green},urlcolor={red}]{hyperref}
\usepackage{hypernat}
%% Alternatives to hyperref for testing
%\usepackage{url}
%\newcommand{\htmladdnormallinkfoot}[2]{#1\footnote{\texttt{#2}}}
%\newcommand{\htmladdnormallink}[1]{\texttt{#1}}
%\newcommand{\href}[2]{\texttt{#2}}
%% The lineno packages adds line numbers. Start line numbering with
%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
%% for the whole article with \linenumbers after \end{frontmatter}.
%% \usepackage{lineno}
%% natbib.sty is loaded by default. However, natbib options can be
%% provided with \biboptions{...} command. Following options are
%% valid:
%% round - round parentheses are used (default)
%% square - square brackets are used [option]
%% curly - curly braces are used {option}
%% angle - angle brackets are used <option>
%% semicolon - multiple citations separated by semi-colon (default)
%% colon - same as semicolon, an earlier confusion
%% comma - separated by comma
%% authoryear - selects author-year citations (default)
%% numbers- selects numerical citations
%% super - numerical citations as superscripts
%% sort - sorts multiple citations according to order in ref. list
%% sort&compress - like sort, but also compresses numerical citations
%% compress - compresses without sorting
%% longnamesfirst - makes first citation full author list
%%
%% \biboptions{longnamesfirst,comma}
% \biboptions{}
\journal{Astronomy \& Computing}
%% Make single quotes look right in verbatim mode
\usepackage{upquote}
\usepackage{upgreek}
\usepackage{color}
% Aim to be consistent, and correct, about how we refer to sections
\newcommand*\secref[1]{Sect.~\ref{#1}}
\newcommand*\appref[1]{\ref{#1}}
\begin{document}
\begin{frontmatter}
%% Title, authors and addresses
%% use the tnoteref command within \title for footnotes;
%% use the tnotetext command for the associated footnote;
%% use the fnref command within \author or \address for footnotes;
%% use the fntext command for the associated footnote;
%% use the corref command within \author for corresponding author footnotes;
%% use the cortext command for the associated footnote;
%% use the ead command for the email address,
%% and the form \ead[url] for the home page:
%%
%% \title{Title\tnoteref{label1}}
%% \tnotetext[label1]{}
%% \author{Name\corref{cor1}\fnref{label2}}
%% \ead{email address}
%% \ead[url]{home page}
%% \fntext[label2]{}
%% \cortext[cor1]{}
%% \address{Address\fnref{label3}}
%% \fntext[label3]{}
\title{ORAC-DR: A generic data reduction pipeline infrastructure}
%% use optional labels to link authors explicitly to addresses:
%% \author[label1,label2]{<author name>}
%% \address[label1]{<address>}
%% \address[label2]{<address>}
\author[jac]{Tim Jenness\corref{cor1}\fnref{timj}}
\ead{[email protected]}
\author[jac]{Frossie Economou\fnref{fe}}
\cortext[cor1]{Corresponding author}
\fntext[timj]{Present address: Department of Astronomy, Cornell University, Ithaca,
NY 14853, USA}
\fntext[fe]{Present address: LSST Project Office, 933 N.\ Cherry Ave, Tucson, AZ 85721, USA}
\address[jac]{Joint Astronomy Centre, 660 N.\ A`oh\=ok\=u Place, Hilo, HI
96720, USA}
\begin{abstract}
%% Text of abstract
ORAC-DR is a general purpose data reduction pipeline system designed
to be instrument and observatory agnostic. The pipeline works with
instruments as varied as infrared integral field units, imaging
arrays and spectrographs, and sub-millimeter heterodyne arrays \&
continuum cameras. This paper describes the architecture of the
pipeline system and the implementation of the core
infrastructure. We finish by discussing the lessons learned since
the initial deployment of the pipeline system in the late 1990s.
\end{abstract}
\begin{keyword}
%% keywords here, in the form: keyword \sep keyword
%% MSC codes here, in the form: \MSC code \sep code
%% or \MSC[2008] code \sep code (2000 is the default)
data reduction pipelines \sep techniques: miscellaneous \sep methods:
data analysis
\end{keyword}
\end{frontmatter}
% \linenumbers
%% Journal abbreviations
\newcommand{\mnras}{MNRAS}
\newcommand{\aap}{A\&A}
\newcommand{\aaps}{A\&AS}
\newcommand{\pasp}{PASP}
\newcommand{\apj}{ApJ}
\newcommand{\apjs}{ApJS}
\newcommand{\qjras}{QJRAS}
\newcommand{\an}{Astron.\ Nach.}
\newcommand{\ijimw}{Int.\ J.\ Infrared \& Millimeter Waves}
\newcommand{\procspie}{Proc.\ SPIE}
\newcommand{\aspconf}{ASP Conf. Ser.}
%% Applications
%% Misc
\newcommand{\recipe}{\emph{Recipe}}
\newcommand{\recipes}{\emph{Recipes}}
\newcommand{\primitive}{\emph{Primitive}}
\newcommand{\primitives}{\emph{Primitives}}
\newcommand{\Frame}{\emph{Frame}}
\newcommand{\Group}{\emph{Group}}
\newcommand{\Index}{\emph{index}}
\newcommand{\oracdr}{\textsc{orac-dr}}
\newcommand{\cgsdr}{\textsc{cgs}{\footnotesize 4}\textsc{dr}}
%% Links
\newcommand{\ascl}[1]{\href{http://www.ascl.net/#1}{ascl:#1}}
%% main text
\section{Introduction}
In the early 1990s each instrument delivered to the United Kingdom
Infrared Telescope (UKIRT) and the James Clerk Maxwell Telescope (JCMT) came
with its own distinct data reduction system that reused very little
code from previous instruments. In part this was due to the rapid
change in hardware and software technologies during the period, but it
was also driven by the instrument projects being delivered
by independent project teams with no standardization requirements
being imposed by the observatory. The observatories were required to
support the delivered code and as operations budgets shrank the need
to use a single infrastructure became more apparent.
\cgsdr\
\citep[][\ascl{1406.013}]{1992ASPC...25..479S,1996ASPC...87..223D} was
the archetypal instrument-specific on-line data reduction system at
UKIRT. The move from VMS to UNIX in the acquisition environment coupled
with plans for rapid instrument development of UFTI
\citep{2003SPIE.4841..901R}, MICHELLE \citep{1993ASPC...41..401G} and
UIST \citep{2004SPIE.5492.1160R}, led to a decision to revamp the
pipeline infrastructure at UKIRT \citep{1998ASPC..145..196E}. In the
same time period the SCUBA instrument \citep{1999MNRAS.303..659H} was
being delivered to the JCMT. SCUBA had an on-line data reduction
system developed on VMS that was difficult to modify and ultimately
was capable solely of simple quick-look functionality. There was no explicit
data reduction pipeline and this provided the opportunity to develop a
truly instrument agnostic pipeline capable of supporting different
imaging modes and wavelength regimes.
The Observatory Reduction and Acquisition Control Data Reduction pipeline
\citep[\oracdr;][\ascl{1310.001}]{1999ASPC..172...11E,2008AN....329..295C} was
the resulting system. In the sections that follow we present an
overview of the architectural design and then describe the pipeline
implementation. We finish by detailing lessons learned during the
lifetime of the project.
\section{Architecture}
The general architecture of the \oracdr\ system has been described
elsewhere \citep{1999ASPC..172...11E,2008AN....329..295C}. To
summarize, the system is split into discrete units with well-defined
interfaces. The recipes define the processing steps that are required
using abstract language and no obvious software code. These recipes
are expanded into executable code by a parser and this code is
executed with the current state of the input data file objects and
calibration system. The recipes call out to external
packages\footnote{These are known as ``algorithm engines'' in the
ORAC-DR documentation.} using a standardized calling interface and it is these
applications that contain the detailed knowledge of how to process pixel
data. In all the currently supported instruments the external algorithm
code is from the Starlink software collection
\citep[][\ascl{1110.012}]{2014ASPC..485..391C} and uses the ADAM
messaging system \citep{1992ASPC...25..126A}, but this is not
required by the \oracdr\ design. There was a deliberate decision to
separate the core pipeline functionality from the high-performance
data processing applications so that one single application
infrastructure was not locked in.
A key part of the architecture is that the pipeline can function
entirely in a data-driven manner. All information required to reduce
the data correctly must be available in the metadata of the input data
files. This requires a systems engineering approach to observatory
operations where the metadata are treated as equal to the science
pixel data \citep[see e.g.,][for an overview of the JCMT and UKIRT
approach]{2011tfa..confE..42J} and all observing modes are designed
with observation preparation and data reduction in mind. An overview
of the pipeline process is shown in Fig.~\ref{fig:flow}.
\begin{figure*}
\includegraphics[width=\textwidth]{oracdr-flow}
\caption{Outline of the control flow in ORAC-DR for a single
observation. For multiple observations the pipeline will either
check for more data at the end of the \recipe\ execution (on-line
mode) or read all files and do group assignments before looping
over groups (batch mode).}
\label{fig:flow}
\end{figure*}
\section{Implementation}
In this section we discuss the core components of the pipeline
infrastructure. The algorithms themselves are pluggable parts of the
architecture and are not considered further. The only requirement
being that the algorithm code must be callable either directly from
Perl or over a messaging interface supported by Perl.
\subsection{Data Detection}
The first step in reducing data is determining which data should be
processed. \oracdr\ separates data detection from pipeline processing,
allowing for a number of different schemes for locating files. In
on-line mode the pipeline is set up to assume an incremental delivery
of data throughout the period the pipeline is running. Here we
describe the most commonly-used options.
\subsubsection{Flag files}
The initial default scheme was to check whether a new file with the
expected naming convention had appeared on disk. Whilst this can work
if the appearance of the data file is instantaneous (for example, it
is written to a temporary location and then renamed), it is all too
easy to attempt to read a file that is being written to. Modifying
legacy acquisition systems to do atomic file renames proved to be
difficult and instead a ``flag'' file system was used.
A flag file was historically a zero-length file created as soon as the
observation was completed and the raw data file was closed. The
pipeline would look for the appearance of the flag file (it would be
able to use a heuristic to know the name of the file in advance and
also look a few ahead in case the acquisition system had crashed) and
use that to trigger processing of the primary data file.
As more complex instruments arrived capable of writing multiple files
for a single observation (either in parallel
\citep[SCUBA-2;][]{2013MNRAS.430.2513H} or sequentially
\citep[ACSIS;][]{2009MNRAS.399.1026B}) the flag system was
modified to allow the pipeline to monitor a single flag file but
storing the names of the relevant data files inside the file (one file
per line). For the instruments writing files sequentially the pipeline
is able to determine the new files that have been added to the file
since the previous check.
Historically synchronization delays over NFS mounts caused
difficulties when the flag file would appear but the actual data file
had not yet appeared to the NFS client computer, but on modern systems
this behavior no longer occurs. Modern file event notification schemes
(such as \texttt{inotify} on Linux) do not generally help with the
data detection problem since, in the current setup, the data reduction
pipelines always mount the data disks from the acquisition computer
over NFS. A more robust solution is to implement a publish/subscribe
system whereby the pipeline monitors the acquisition computers for new
data. Such a scheme is discussed in the next section.
\subsubsection{Parameter monitoring}
The SCUBA-2 quick look pipeline \citep{2005ASPC..347..585G} had a
requirement to be able to detect files taken at a rate of
approximately 1\,Hz for stare observations. This was impractical using
a single-threaded data detection system embedded in the pipeline process and
using the file system. Therefore, for SCUBA-2 quick-look processing the
pipeline uses a separate process that continually monitors the
four data acquisition computers using the DRAMA messaging system
\citep{1995SPIE.2479...62B}. When all four sub-arrays indicate that a
matched dataset is available the monitored data are written to disk
and a flag file created. Since these data are ephemeral there is a
slight change to flag file behavior in that the pipeline will take
ownership of data it finds by renaming the flag file. If that happens
the pipeline will be responsible for cleaning up; whereas if the
pipeline does not handle the data before the next quick look image
arrives the gathering process will remove the flag file and delete the
data before making the new data available.
\subsection{File format conversion}
Once files have been found they are first sent to the format
conversion library. The instrument infrastructure defines what the
external format of each file is expected to be and also the internal format
expected by the reduction system. The format conversion system knows
how to convert the files to the necessary form. This does not always
involve a change in low level format (such as FITS to NDF) but can
handle changes to instrument acquisition systems such as converting
HDS files spread across header and exposure files into a single HDS
container matching the modern UKIRT layout.
\subsection{Recipe Parser}
A \recipe\ is the top-level view of the data processing steps
required to reduce some data. The requirements were that the recipe
should be easily editable by an instrument scientist without having to
understand the code, the \recipe\ should be easily understandable by
using plain language, and it should be possible to reorganize steps
easily. Furthermore, there was a need to allow \recipes\ to be edited
``on the fly'' without having to restart the pipeline. The next data file
to be picked up would be processed using the modified version of the
\recipe\ and this is very important during instrument commissioning. An
example, simplified, imaging \recipe\ is shown in Fig.\
\ref{fig:recipe}. Each of these steps can be given parameters to
modify their behavior. The expectation was that these \recipes\ would
be loadable into a Recipe Editor GUI tool, although such a tool was
never implemented.
\begin{figure}
{
\small
\begin{verbatim}
_SUBTRACT_DARK_
_DIVIDE_BY_FLAT_
_BIAS_CORRECT_GROUP_
_APPLY_DISTORTION_TRANSFORMATION_
_GENERATE_OFFSETS_JITTER_
_MAKE_MOSAIC_ FILLBAD=1 RESAMPLE=1
\end{verbatim}
}
\caption{A simplified imaging \recipe. Note that the individual steps
make sense scientifically and it is clear how to change the order or
remove steps. The \texttt{\_MAKE\_MOSAIC\_} step includes override
parameters.}
\label{fig:recipe}
\end{figure}
Each of the steps in a \recipe\ is known as a
\primitive. The \primitives\ contain the Perl source code and can
themselves call other \primitives\ if required.
The parser's core job is to read the \recipe, replace the mentions of \primitives\
with subroutine calls to the source code for that primitive. For each
\primitive\ the parser keeps a cache containing the compiled form of
the \primitive\ as a code reference, the modification time associated
with the \primitive\ source file when it was last read, and the full
text of the \primitive\ for debugging purposes. Whenever a \primitive\
code reference is about to be executed the modification time is
checked to decide whether the \primitive\ needs to be re-read.
The parser is also responsible for adding additional code at the
start of the \primitive\ to allow it to integrate into the general
pipeline infrastructure. This code includes:
\begin{itemize}
\item Handling of state objects that are passed through the subroutine
argument stack and parsing of parameters passed to the \primitive\
by the caller. These arguments are designed not be language-specific
and use a simple \texttt{KEYWORD=VALUE} syntax
and cannot be handled directly by the Perl interpreter.
\item Trapping for \primitive\ call recursion.
\item Debugging information
such as timers to allow profile information to be
collected, and entry and exit log messages to indicate exactly when
a routine is in use.
\item Callbacks to GUI code to indicate which \primitive\ is
currently active.
\item Configuring the logging system so that all messages appearing
will be associated with the correct primitive when they are written
to the history blocks (see \secref{sec:prov} for details).
\end{itemize}
The design is such that adding new code to the entry and exit of each
\primitive\ can be done in a few lines with little overhead. In
particular, use is made of the \verb|#line| directive in Perl that
allows for the line number to be manipulated such that error messages
reflect the line number in the original \primitive\ and not the line
number in the expanded \primitive.
Calling external packages is a very common occurrence and is also where
most of the time is spent during \recipe\ execution. In order to
minimize repetitive coding for error conditions and to allow for profiling, calls to
external packages are surrounded by code to automatically handle these
conditions. This allows the programmer to focus on the \recipe\ logic
and not have to understand all the failure modes for a particular
package.\footnote{The \texttt{oracdr\_parse\_recipe} command can be run
to provide a complete translation of a \recipe.} The parser is
designed such that if a particular error code is important (for
example there might be an error code indicating that a failure was due
to there being too few stars in the image) then the automated error
handling is changed if the \primitive\ writer is explicitly asking to
check the return value from the external application.
\subsection{Recipe Parameters}
The general behavior of a recipe can be controlled by editing it and
adjusting the parameters passed to the \primitives. A much more
flexible scheme is available which allows the person running the
pipeline to specify a \recipe\ configuration file that can be used to
control the behavior of \recipe\ selection and how a \recipe\ behaves.
The configuration file is a text file written in the INI
format. Although it is possible for the \recipe\ to be specified on
the command-line that \recipe\ would be used for all the files being
reduced in the same batch and this is not an efficient way to
permanently change the \recipe\ name. Changing the file header is not
always possible so the configuration file can be written to allow
per-object selection of \recipes. For example,
\begin{quote}
\begin{verbatim}
[RECIPES_SCIENCE]
OBJECT1=REDUCE_SCIENCE
OBJECT2=REDUCE_FAINT_SOURCE
A.*=BRIGHT_COMPACT
\end{verbatim}
\end{quote}
would select \texttt{REDUCE\_SCIENCE} whenever a \emph{science}
observation of OBJECT1 is encountered but choose
\texttt{REDUCE\_FAINT\_SOURCE} whenever OBJECT2 is found. The third
line is an example of a regular expression that can be used to select
recipes based on a more general pattern match of the object name. This relies
on header translation functioning to find the observation type and
object name correctly. This sort of configuration is quite common when the
Observing Tool has not been set up to switch recipes.
Once a \recipe\ has been selected it can be configured as simple
key-value pairs:
\begin{quote}
\begin{verbatim}
[REDUCE_SCIENCE]
PARAM1 = value1
PARAM2 = value2
[REDUCE_SCIENCE:A.*]
PARAM1 = value3
\end{verbatim}
\end{quote}
and here, again, the parameters selected can be controlled by a
regular expression on the object name. The final set of parameters is
made available to the primitives in a key-value lookup table.
\subsection{Recipe Execution}
\label{sec:exec}
Once a set of files has been found the header is read to determine
how the data should be reduced. Files from the same observation are
read into what is known as a \Frame\ object. This object contains all
the metadata and pipeline context and, given that the currently used
applications require files to be written, the name of the
currently active intermediate file (or files for observations that
either consist of multiple files or which generate multiple
intermediate files). In some cases, such as for ACSIS, a single
observation can generate multiple files that are independent and in
these cases multiple \Frame\ objects are created and they are
processed independently. There is also a \Group\ object which
contains the collection of \Frame\ objects that the pipeline should
combine.
The pipeline will have been initialized to expect a particular instrument and
the resulting \Frame\ and \Group\ objects will be instrument-specific subclasses.
The \Frame\ object contains sufficient information to allow the
pipeline to work out which \recipe\ should be used to reduce the
data. The \recipe\ itself is located by looking through a search path
and modifiers can be specified to select recipe variants. For example,
if the recipe would normally be \texttt{REDUCE\_SCIENCE} the pipeline
can be configured to prefer a recipe suffix of \texttt{\_QL} to
enable a quick-look version of a recipe to be selected at the summit
whilst selecting the full recipe when running off-line.
The top-level \recipe\ is parsed and is then evaluated in the
parent pipeline context using the Perl \texttt{eval} function. The
\recipe\ is called with the relevant \Frame, and \Group\ objects along
with other context. The
reason we use \texttt{eval} rather than running the recipe in a
distinct process is to allow the recipe to update the state. As
discussed in \secref{sec:onvoff}, the pipeline is designed to
function in an incremental mode where data are reduced as they arrive,
with group co-adding either happening incrementally or waiting for a
set cadence to complete. This requires that the group processing stage
knows the current state of the \Group\ object and of the contributing
\Frame\ objects. Launching an external process to execute the
recipe each time new data arrived would significantly complicate the
architecture.
As noted in the previous section, the \recipe\ is parsed incrementally
and the decision on whether to re-read a \primitive\ is deferred until
that \primitive\ is required. This is important for instruments such
as MICHELLE and UIST which can observe in multiple modes
(spectroscopy, imaging, IFU), sometimes
requiring a single recipe invocation to call \primitives\ optimized
for the different modes. The execution environment handles this by
allowing a caller to set the instrument mode and this dynamically
adjusts the \primitive\ selection code.
\subsection{Header Translation}
As more instruments were added to \oracdr\ it quickly became apparent
that many of the \primitives\ were being adjusted to support different
variants of FITS headers through the use of repetitive if/then/else
constructs. This was making it harder to support the code and it was
decided to modify the \primitives\ to use standardized headers. When a
new \Frame\ object is created the headers are immediately translated
to standard form and both the original and translated headers are
available to \primitive\ authors.
The code to do the translation was felt to be fairly generic and was
written to be a standalone
module\footnote{\texttt{Astro::FITS::HdrTrans}, available on
CPAN}. Each instrument header maps to a single translation class
with a class hierarchy that allows, for example, JCMT instruments to
inherit knowledge of shared JCMT headers without requiring that the
translations be duplicated. Each class is passed the input header and
reports whether the class can process it, and it is an error for multiple
classes to be able to process a single header. A method exists for each
target generic header where,
for example, the method to calculate the start airmass would be
\texttt{\_to\_AIRMASS\_START}. The simple unit mappings (where there
is a one-to-one mapping of an instrument header to a generic header
without requiring changes to units) are defined as simple Perl lookup tables
but at compile-time the corresponding methods are generated so that
there is no difference in interface for these cases. Complex mappings
that may involve multiple input FITS headers, are written as explicit
conversion methods.
The header translation system can also reverse the mapping such that a
set of generic headers can be converted back into instrument-specific
form. This can be particularly useful when required to update a header
during processing.
\subsection{Calibration System}
During \Frame\ processing it is necessary to make use of calibration
frames or parameters derived from calibration observations. The early
design focused entirely on how to solve the problem of selecting the
most suitable calibration frame for a particular science observation
without requiring the instrument scientist to write code or understand
the internals of the pipeline. The solution that was adopted involves
two distinct operations: filing calibration results and querying those results.
When a calibration image is reduced (using the same pipeline
environment as science frames) the results of the processing are
registered with the calibration system. Information such as the name
of the file, the wavelength, and the observing mode are all stored in the \Index.
In the current system the \Index\ is a text file on disk that is cached by
the pipeline but the design would be no different if an SQL database
was used instead; no \primitives\ would need to be modified to switch
to an SQL backend. The only requirement is that the \Index\ is
persistent over pipeline restarts (which may happen a lot during
instrument commissioning).
The second half of the problem was to provide a rules-based system.
A calibration rule simply indicates how a header in the science data
must relate to a header in the calibration database in order for the
calibration to be flagged as suitable. The following is an excerpt
from a rules file for an imaging instrument dark calibration:
\begin{quote}
{\small
\begin{verbatim}
OBSTYPE eq 'DARK'
MODE eq $Hdr{MODE}
EXP_TIME == $Hdr{EXP_TIME}
MEANCOUNT
\end{verbatim}
}
\end{quote}
Each row in the rules file is evaluated in turn by replacing the
unadorned keyword with the corresponding calibration value read from
the \Index\ and the \texttt{\$Hdr} corresponding to the science
header. In the above example the
calibration would match if the exposure times and observing readout
mode match and the calibration itself is a dark.
These rules are evaluated using the Perl \texttt{eval} command
so the full Perl interpreter is available. This allows for
complex rules to be generated such as a rule that allows a calibration to expire
if it is too old.
The rules file itself represents the schema of the database in
that for every line in the rules file, information from that
calibration is stored in the \Index. In the example above,
\texttt{MEANCOUNT} is not used in the rules processing but the
presence of this item means that the corresponding value will be
extracted from the header of the calibration image and registered in
the calibration database. Once an item is stored in the calibration
database a calibration query will make that value available in
addition to the name of the matching calibration file.
It is therefore simple for the instrument
scientist to add a new header for tracking, although this does require
that the old \Index\ is removed and the data reprocessed to regenerate
a new \Index\ in the correct form.
The calibration selection system can behave differently in off-line
mode as the full set of calibrations can be made available and
calibrations taken after the current observation may be relevant. Each
instrument's calibration class can decide whether this is an
appropriate behavior.
The calibration system can also be modified by a command-line argument at
run time to allow the user to decide which behavior to use. For
example, with the SCUBA pipeline \citep{1999ASPC..172..171J} the user
can decide which opacity calibration scheme they require from a number
of options.
One of the more controversial aspects of the calibration system was
that the UKIRT pipelines would stop and refuse to reduce data if no
suitable calibration frame had been taken previously (such as a dark
taken in the wrong mode or with the wrong exposure). This sometimes
led to people reporting that the pipeline had crashed (and so was
unstable) but the purpose was to force the observer to stop and think
about their observing run and ensure that they did not take many hours
of data with their calibration observations being taken in a manner
incompatible with the science data. A pro-active pipeline helped to
prevent this and also made it easier to support flexible scheduling
\citep{2002ASPC..281..488E,2004SPIE.5493...24A} without fearing that
the data were unreducible.
This hard-line approach to requiring fully calibrated observations,
even if the PI's specific science goals did not require it, was
adopted in anticipation of the emergence of science data archives as
an important source of data for scientific papers. Casting the PI not
as the data owner, but rather as somebody who is being leased
observatory data from the public domain for the length of their
proprietary period, requires an observation as only being complete if
fully calibratable. In that way, the telescope time's value is
maximized by making the dataset useful to the widest range of its
potential uses. To this end, the authors favor a model where for
flexibly-scheduled PI-led facilities, calibration time is not deducted
from the PI's allocation.
\subsection{Provenance Tracking}
\label{sec:prov}
For the outputs from a data reduction pipeline it is important for
astronomers to understand what was done to the data and how they can
reproduce the processing steps. \oracdr\ manages this provenance and
history tracking in a number of different ways. The pipeline makes available to
\primitives\ the commit ID (SHA1) of the pipeline software and the
commit ID of the external application package. It is up to the
\recipe\ to determine whether use should be made of that
information. For the \recipes\ that run at the JCMT Science Archive
\citep{2014Economou} there is code that inserts this information, and
the \recipe\ name, into data headers. Summit processing \recipes\ do
not include this detail as the products are generally thought to be
transient in nature as the \recipes\ are optimized for speed and
quality assurance tracking rather than absolute data quality. One
caveat in this approach is that an end-user who modifies a
\recipe\ will not see any change as the commit ID will not have
changed. This was thought to be of secondary importance compared to
the major use case of archive processing but does need consideration
before the reproducibility aspects of data reduction can be considered
complete.
Detailed tracking of the individual steps of the processing is
handled differently in that the pipeline is written with the
assumption that the external applications will track provenance and
history themselves. This is true for the Starlink software where the
NDF library, which already supported detailed history tracking, was
updated to also support file provenance so that all ancestor files
could be tracked \citep[see e.g.][for details on the provenance algorithm]{ndfjenness}.
We took this approach because we felt it was far too complicated to
require that the pipeline infrastructure and \primitives\ track what
is being done to the data files. Modifying the file I/O library meant
that provenance tracking would be available to all users of the
external packages (in this case the Starlink software applications)
and not just the pipeline users. The history information automatically logged by the external
applications is augmented by code in the pipeline that logs the
primitive name whenever header information is synchronized to a file,
and, optionally, all text messages that are output by a \primitive\ can be
stored as history items in the files written by the \primitive.
\subsection{Configurable Display System}
On-line pipelines are most useful when results are displayed to the
observer. One complication with pipeline display is that different
observers are interested in different intermediate data products or
wish the final data products to be displayed in a particular
way. Display logic such as this cannot be embedded directly in
\primitives; all a \primitive\ can do is indicate that a particular
product \emph{could} be displayed and leave it to a different system
to decide \emph{whether} the product should be displayed and how to
do so.
The display system uses the \oracdr\ file naming convention to
determine relevance. Usually, the text after the last underscore,
referred to as the file suffix, is used to indicate the reduction step
that generated the file: \texttt{mos} for mosaic, \texttt{dk} for
dark, etc. When a \Frame\ or \Group\ is passed to the display system
the file suffix and, optionally a \Group\ versus \Frame\ indicator,
are used to form an identifier which is compared with the entries in
the display configuration file. For each row
containing a matching identifier the files will be passed to the
specific display tool. Different plot types are available such as
image, spectrum, histogram, and vector plot and also a specific mode
for plotting a 1-dimensional dataset over a corresponding model. Additional
parameters can be used to control placement within a viewport and how
auto-scaling is handled. The display system currently supports \textsc{gaia}
\citep[][\ascl{1403.024}]{2009ASPC..411..575D} and \textsc{kappa}
\citep[][\ascl{1403.022}]{SUN95} as well as the historical P4 tool
(part of \cgsdr\ \citep{SUN27} and an important influence on the
design).
Originally the display commands would be handled within the \recipe\
execution environment and would block the processing until the display
was complete. This can take a non-negligible amount of time and for the
SCUBA-2 pipeline to meet its performance goals this delay was
unacceptable. The architecture was therefore modified to allow the
display system running from within the \recipe\ to register the
display request but for a separate process to be monitoring these
requests and triggering the display.
\subsection{Support modules}
As well as the systems described above there are general support
modules that provide standardized interfaces for message output, log files
creation and temporary file handling.
The message output layer is required
to allow messages from the external packages and from the \primitives\
to be sent to the right location. This might be a GUI, the terminal or
a log file (or all at once) and supports different messaging levels to
distinguish verbose messages, from normal messages and
warnings. Internally this is implemented as a tied object that
emulates the file handle API and contains multiple objects to allow
messages to be sent to multiple locations.
Log files are a standard requirement for storing information of
interest to the scientist about the processing such as
quality assurance parameters or photometry results. The pipeline
controls the opening of these files in a standard way so that the
primitive writer simply has to worry about the content.
With the current external applications there are many intermediate files
and most of them are temporary. The allocation of filenames is handled
by the infrastructure and they are cleaned up automatically unless the
pipeline is configured in debugging mode to retain them.
\section{Supporting New Instruments}
An important part of the \oracdr\ philosophy is to make adding new
instruments as painless as possible and re-use as much of the
existing code as possible. The work required obviously depends on the
type of instrument. An infrared array will be straightforward as many
of the \recipes\ will work with only minor adjustments. Adding support
for an X-Ray telescope or radio interferometer would require
significantly more work on the recipes.
To add a new instrument the following items must be considered:
\begin{itemize}
\item How are new data presented to the pipeline? \oracdr\ supports a
number of different data detection schemes but cannot cover every option.
\item What is the file format? All the current \recipes\ use Starlink
applications that require NDF \citep{ndfjenness} and if FITS
files are detected the infrastructure converts them to NDF before
handing them to the rest of the system. If the raw data are in HDF5,
or use a very complex data model on top of FITS, new code will have
to be written to support this.
\item How to map the metadata to the internal expectations of the
pipeline? A new module would be needed for \texttt{Astro::FITS::HdrTrans}.
\item Does it need new \recipes/\primitives? This depends on how close
the instrument is to an instrument already supported. The \recipe\
parser can be configured to search in instrument-specific
sub-directories and, for example, the Las Cumbres Observatory
imaging recipes use the standard \primitives\ in many cases but also
provide bespoke versions that handle the idiosyncrasies of their
instrumentation.
\end{itemize}
Once this has been decided new subclasses will have to be written to
encode specialist behavior for \Frame\ and \Group\ objects and the
calibration system, along with the instrument initialization class
that declares the supported calibrations and applications.
\section{Lessons Learned}
\subsection{Language choice can hinder adoption}
In 1998 the best choice of dynamic ``scripting'' language for an astronomy project was
still an open question with the main choices being between Perl and
Tcl/Tk with Python being a distant third
\citep{1995ComPh...9...57A,1999ASPC..172..494J,1999ASPC..172..483B,2000ASPC..216...91J}.
Tcl/Tk had already been adopted by Starlink
\citep{1995ASPC...77..395T}, STScI \citep{1998SPIE.3349...89D},
SDSS \citep{1996ASPC..101..248S} and ESO \citep{1996ASPC..101..396H,1995ASPC...77...58C} and
would have been the safest choice, but at the time it was felt that
the popularity of Tcl/Tk was peaking. Perl was chosen as it was a language
gaining in popularity and the development team were proficient in
it in addition to developing the Perl Data Language \citep[PDL;][]{PDL}
promising easy handling of array data; something Tcl/Tk was incapable
of handling.
Over the next decade and a half, beginning with the advent of \texttt{pyraf}
\citep[][\ascl{1207.010}]{2000ASPC..216...59G,2006hstc.conf..437G}
and culminating in Astropy \citep[][\ascl{1304.002}]{2013A&A...558A..33A},
Python became the dominant language for astronomy,
becoming the \emph{lingua franca} for new students in astronomy and
the default scripting interface for new data reductions systems such
as those for ALMA
\citep{2007ASPC..376..127M} and LSST \citep{2010SPIE.7740E..15A}.
In this environment, whilst \oracdr\ received much interest from other
observatories, the use of Perl rather than Python became a
deal-breaker given the skill sets of development groups. During this
period only two additional observatories adopted the pipeline: the
Anglo-Australian Observatory for IRIS2 \citep{2004SPIE.5492..998T} and Las Cumbres
Observatory for their imaging pipeline \citep{2013PASP..125.1031B}.
The core design concepts were not at issue, indeed, Gemini adopted the
key features of the \oracdr\ design in their Gemini Recipe System
\citep{2014ASPC..485..359L}. With approximately 100,000 lines of Perl code in
\oracdr\footnote{For infrastructure and \primitives, but counting code only, with comments adding more than
100,00 lines to that
number. Blank line count not included, nor are support modules from CPAN
required by the pipeline but distributed separately.} it
is impractical to rewrite it all in Python given that the system does
work as designed.
Of course, a language must be chosen without the benefit of hindsight
but it is instructive to see how the best choice for a particular
moment can have significant consequences 15 years later.
\subsection{In-memory versus intermediate files}
When \oracdr\ was being designed the choice was between IRAF
\citep[][\ascl{9911.002}]{2012ASPC..461..595F} and Starlink for the
external packages.
At the time the answer was that Starlink messaging and error reporting were
significantly more robust and allowed the \primitives\ to adjust their
processing based on specific error states (such as there being too few
stars in the field to solve the mosaicking offsets). Additionally,
Starlink supported variance propagation and a structured data format.
From a software
engineering perspective Starlink was clearly the correct choice but it
turned out to be yet another reason why \oracdr\ could not be adopted
by other telescopes. Both these environments relied on each command
reading data from a disk file, processing it in some way and then
writing the results out to either the same or a new file. Many of
these routines were optimized for environments where the science data
was comparable in size to the available RAM and went to great lengths
to read the data in chunks to minimize swapping. It was also not
feasible to rewrite these algorithms (that had been well-tested) in
the Perl Data Language, or even turn the low-level libraries into Perl
function calls, and the penalty involved in continually reading
and writing to the disk was deemed to be a good trade off.
As it turns out, the entire debate of Starlink versus IRAF is somewhat
moot in the current funding climate and in an era where many pipeline
environments \citep[e.g.,][]{2010SPIE.7740E..15A} are abandoning
intermediate files and doing all processing in memory for performance
reasons, using, for example, \texttt{numpy} arrays or ``piddles''\footnote{A
``piddle'' is the common term for an array object in the Perl Data Language;
an instance of a \texttt{PDL} object.}. For
instruments where the size of a single observation approaches 1\,TB
\citep[e.g., SWCam at CCAT;][]{2014SPIE9153-21} this presents a
sizable challenge but it seems clear that this is the current trend
and a newly written pipeline infrastructure would assume that all
algorithm work would be in memory.
\subsection{Recipe configuration is needed}
Initially, the intent was for \recipes\ to be edited to suit different
processing needs and for the processing to be entirely driven by the
input data. This was driven strongly by the requirement that the
pipeline should work at the telescope without requiring intervention
from the observer. The initial design was meant to be that the
astronomer would select their \recipe\ when they prepared the
observation and that this would be the \recipe\ automatically picked
up by the pipeline when the data were observed. Eventually we realized
that anything more than two or three recipes to choose from (for
example, is your object broad line or narrow line, or are your objects
extremely faint point sources or bright extended structures?) in the
Observing Tool became unwieldy and most people were not sure how they
wanted to optimize their data processing until they saw what the
initial processing gave them.
After many years of resistance a system was developed in 2009 for
passing \recipe\ parameters from configuration files to the pipeline
and this proved to be immensely popular. It is much simpler for people
to tweak a small set of documented parameters than it is to edit
recipes and it is also much easier to support many project-specific
configuration files than it is to keep track of the differences
between the equivalent number of bespoke recipes. When a processing
job is submitted to the JCMT Science Archive any associated
project-specific configuration file is automatically included, and
these can be updated at any time based on feedback from the data
products. It took far too long to add this functionality and this
delay was partly driven by the overt focus on online functionality
despite the shift to the pipeline being used predominantly in an
offline setting. This is discussed further in the next section.
\subsection{Online design confused offline use}
\label{sec:onvoff}
\oracdr\ was initially designed for on-line summit usage where data
appear incrementally and where as much processing should be done on
each frame whilst waiting for subsequent frames to arrive. As
discussed previously (\secref{sec:exec}), this led to the
execution of the \recipe\ within the main process so that context
could be shared easily.
For off-line mode the environment is very different and you would
ideally wish to first reduce all the calibration observations, then
process all the individual science observations and finally do the
group processing to generate mosaics and co-adds. When doing this the
only context that would need to be passed between different \recipe\
executions would be the calibration information that is already
persistent. Indeed, the \recipes\ themselves could be significantly
simplified in that single observation \recipes\ would not include any
group processing instructions. This is not strictly possible in all
cases. For the ACSIS data reduction \recipes\ \citep{JennessACSISDR}
the output of the frame processing depends on how well the group
co-adding has been done; the more measurements that are included, the
better the baseline subtraction.
As written, the recipes have to handle both on-line and off-line
operation and this is achieved by the group \primitives\ being
configured to be no-ops if they realize that the \Frame\ object that
is currently being processed is not the final member of the group.
Whilst the off-line restrictions can be annoying to someone reducing a
night of data on their home machine, it is possible to deal with the