-
Notifications
You must be signed in to change notification settings - Fork 1
/
classes.qmd
81 lines (45 loc) · 5.11 KB
/
classes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Major internal classes {#classes}
`targets` relies heavily on object-oriented programming, and the following subsections describe the major classes of objects relevant to orchestration and branching. Objects that appear once per workflow are instances of `R6` classes. To maximize performance, target-specific data structures are lower-tech classes: simple environments with formal constructors, helpers, and validators.
## Algorithm class
An algorithm in `targets` is an abstract class that represents how to iterate through the pipeline target by target. Different algorithms describe different kinds of deployment: for example, execution in the main process on the host machine versus parallel execution on a cluster. Every algorithm has a scheduler object and a pipeline object.
## Pipeline class
A pipeline is a wrapper around a collection of targets, and it is responsible for the initial reasoning about the topology of the pipeline before runtime. Pipelines express their reasoning by producing static graphs and scheduler objects early on. In addition, pipelines contain new buds and branches created dynamically during the pipeline.
## Scheduler class
Whereas pipelines are responsible for *static* topology, schedulers are responsible for *dynamic* topology. Schedulers know
1. The upstream and downstream neighbors of each target.
1. The progress of each target, e.g. queued, running, or finished.
1. How many upstream dependencies need to be checked or built before a target is ready to run.
To meet these responsibilities, the scheduler is composed of three smaller objects:
1. A graph object.
1. A progress object.
1. A priority queue.
## Graph class
The graph class keeps track of the upstream and downstream neighbors of each target. The scheduler adds edges to the graph when new targets are created dynamically.
The graph is implemented as two adjacency lists: one for upstream edges and another for downstream edges. For the purposes of powering a pipeline, we find this low-tech structure to be more efficient than `igraph` in our situation where we repeatedly query the graph and the number of nodes is small. (Transient `igraph` objects, however, are created for validation and visualization purposes.)
## Progress class
The progress class keeps track of the state of each target: queued, running, skipped, built, outdated, canceled, or errored. To accomplish this, the progress object maintains a counter object for each category.
## Queue class
The queue class is a priority queue, essentially a wrapper around a named integer vector of ranks. For `targets`' purposes, the rank of a target is the number of unmet dependencies so far, minus a per-target priority value in the interval `[0, 1)` to control the order in which targets are dequeued. As the pipeline progresses, the queue is checked and modified periodically as dependencies are met. The next target to build is the lowest rank target such that `-1L < rank <= 0L`. Branches are pushed to the queue when they are dynamically created.
## Counter class
A counter is an efficient abstraction for keeping track of target membership in a category. A counter stores the number of targets in the category and a hash table with the names of those targets. Counters are used to efficiently keep track of runtime progress (e.g. running, queued, or built) as well membership in the queue.
## Target class
A target is an abstract class for a step of a pipeline. Each target is a composite of intricate sub-classes that keep track of commands, in-memory dependencies, storage, settings, and some aspects of build behavior. As its name implies, the `targets` package pushes most of its conceptual complexity to the target level in order to decentralize the architecture and make it much easier to reason about the pipeline as a whole.
There are multiple sub-classes of targets, and the different behaviors of different sub-classes drive the orchestration and branching of targets. The inheritance hierarchy is as follows.
* Target
* Bud
* Builder
* Stem
* Branch
* Pattern
## Stem class
A stem is the most basic form of target. It is neither a pattern nor part of a pattern. However, it can dynamically create buds to assist with branching.
## Bud class
A bud is a target that simply contains part of a stem's value in memory. The purpose of a bud is to serve as a dependency of a branch when a pattern branches over a stem.
## Pattern class
A pattern is an abstract class responsible for creating new branches and dynamically updating the scheduler. Stems and patterns are the only targets the user manually defines in the pipeline.
## Branch class
A branch is a target that a pattern creates dynamically at runtime.
## Builder class
Stems and branches have a lot in common: they actually run R commands, and the write the return values to storage. A builder is an abstract class to contain the heavy lifting that stems and branches both do.
## Junction class
A junction is a manifest of the buds or branches (children) that a stem or pattern dynamically creates, and it depends on the user-specified `pattern` argument to `tar_target()`, e.g. `tar_target(..., pattern = cross(x, map(y, z)))`.