-
Notifications
You must be signed in to change notification settings - Fork 1
/
data.qmd
104 lines (75 loc) · 9.49 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Data and metadata management {#data}
Like its predecessor, `drake`, the `targets` package
1. Abstracts files as R objects.
2. Records the real-time progress of targets as they are running.
3. Records special metadata in order to skip targets that are already up to date.
Unlike `drake`, which outsources data and metadata management to an external package, `targets` has an entirely custom internal data system. `targets` goes out of its way to reduce the number of files in storage, centralize the progress data and metadata, assign informative file names, and expose the file system to the user. This approach increases efficiency and portability, and it helps users understand and take control of their data.
## File system
When a `targets` pipeline runs, it creates a folder called `_targets` to store all the files it needs. In `targets` version 0.3.1.9000 and above, users can set the data store path to something other than `_targets` (see [`tar_config_set()`](https://docs.ropensci.org/targets/reference/tar_config_set.html)).
```
_targets/ # Configurable with tar_config_set().
├── meta/
├────── meta
├────── process
├────── progress
├── objects/
├────── target1
├────── target2
├────── branching_target_c7bcb4bd
├────── branching_target_285fb6a9
├────── branching_target_874ca381
├── scratch/ # tar_make() deletes this folder after it finishes.
└── user/ # gittargets users can put custom files here for data version control.
```
The number of files equals the number of targets plus two, which makes projects easier to upload and share among collaborators than `drake`'s `.drake/` cache. (Files in `_targets/scratch/` do not count because they can all be safely deleted after `tar_make()`.) However, these files may still be too large and too numerous for code-specific version control systems like Git. For such projects, it may be more appropriate to share caches through external version-aware platforms such as Dropbox, Microsoft OneDrive, and Google Docs.
The exception to this data structure is content-addressable storage as explained in the documentation of `targets::tar_repository_cas()`.
## Targets
With the exception of dynamic files, the return value of each target lives in its own file inside `_targets/objects/`. The file name is the name of the target, and there is no file extension. The metadata keeps track of the storage format that governs how to read and write the target's data. The default format is RDS, so if target `x` has no explicit format, then `readRDS("_targets/objects/x")` will read the data. (However, we state this just for the sake of understanding. The recommended way to read data is `tar_read()`, which takes the storage format into account.)
## Process
The file `_targets/meta/process` is a pipe-separated flat file recording high-level information about the external `callr` process that orchestrates the targets. In that text file is the process ID, which can be used to check if `tar_make()` is still running in certain situations. Notably, it helps Shiny developers make apps that allow the user to log out and then resume the session after logging back in.
## Progress
The file `_targets/meta/progress` is a pipe-separated flat file with the name of each target and it's current runtime progress (running, built, canceled, or errored). The information in this file helps users keep track of what the pipeline is doing at a given moment. `targets` periodically appends rows to `_targets/meta/progress` as the pipeline progresses, so duplicated names usually appear. For any target with duplicated rows in `_targets/meta/progress`, only the lowest row is valid.
In most situations, the progress file can be safely excluded from version control. Functions like `tar_graph()` use progress information, but it is not essential to the reproducible end product.
## Metadata
`targets` uses special metadata to decide which targets are up to date and which need to run. The metadata file `_targets/meta/meta` is a flat file with one row for every target and every global object relevant to the pipeline. `targets` appends new rows to this file as the pipeline progresses. Unlike `drake`, the metadata is centralized and compatible with `data.table`, which makes it far faster to check which targets are up to date. In addition, the metadata system allows `targets` to check not only for up-to-date targets, but also up-to-date global objects, which makes it easier for the user to understand *why* a target is outdated.
`_targets/meta/meta` has the following columns. Global objects use only the `name`, `type`, and `data` fields.
* `name`: Name of the object or target.
* `type`: Class name of the object or target.
* `data`: Hash of the global object or the file containing the target's return value.
* `command`: Hash of the R command to run the target.
* `depend`: Composite hash of all the target's immediate upstream dependencies.
* `seed`: Random number generator seed of the target. A target seed is unique and deterministically generated from its name.
* `path`: The file path where the return value is stored. For dynamic files, this field could include multiple character strings.
* `time`: Character, hash of the maximum of all the time stamps of the files in `path`.
* `size`: Character, hash of the total file size of all the target's files in `path`.
* `bytes`: Numeric, total file size in bytes of all the target's files in `path`.
* `format`: Name of the storage format of the target. User-specified with `tar_target()` or `tar_option_set()`.
* `iteration`: Iteration mode of the target's value, either `"vector"` or `"list"`. User-specified with `tar_target()` or `tar_option_set()`.
* `parent`: Name of the parent pattern of the target if the target is a branch.
* `children`: For patterns and branching stems, this field has the names of all the branches and buds. Can contain multiple character strings. Empty for branches and non-branching stems.
* `seconds`: Runtime of the target in seconds.
* `warnings`: Warning messages thrown when the target ran.
* `error`: Error message thrown when the target ran.
These fields are pipe-separated in the flat file. Fields `path` and `children` can have multiple character strings, and these character strings are separated by asterisks in storage. (In memory, `path` and `children` are list columns.)
## Skipping up-to-date targets
`targets` uses the metadata to decide if a target is up to date. The `should_run()` method of the `builder` class manages this. A target is outdated if one of the following conditions is met. `targets` checks these rules in the order given below. There is a special `cue` class to allow the user to customize / suppress most of these rules.
1. There is no metadata record of the target.
1. The target errored last run.
1. The target has a different class than it did before.
1. The cue mode equals `"always"`.
1. The cue mode does not equal `"never"`.
1. The `command` metadata field (the hash of the R command) is different from last time.
1. The `depend` metadata field (the hash of the immediate upstream dependency targets and global objects) is different from last time.
1. The storage `format` (user-specified with `tar_target()` or `tar_option_set()`) is different from last time.
1. The storage `repository`` (user-specified with `tar_target()` or `tar_option_set()`) is different from last time.
1. The `iteration` method (user-specified with `tar_target()` or `tar_option_set()`) is different from last time.
1. A target's file (either the one in `_targets/objects/` or a dynamic file) does not exist or changed since last time.
1. The target-specific random number generator seed is different from last time.
A target's dependencies can include functions, and these functions are tracked for changes using a custom hashing procedure. When a function's hash changes, the function is considered invalidated, and so are any downstream targets with the `depend` cue turned on. The `targets` package computes the hash of a function in the following way.
1. Deparse the function with `targets:::safe_deparse()`. This function computes a string representation of the function that removes comments and standardizes whitespace so that trivial changes to formatting do not cue targets to rerun.
1. Manually remove any literal pointers from the function string using `targets:::mask_pointers()`. Such pointers arise from inline compiled C/C++ functions.
1. Compute a hash on the preprocessed string above using `targets:::digest_chr64()`.
Those functions themselves have dependencies, and those dependencies are detected with `codetools::findGlobals()`. Dependencies of functions may include other global functions or global objects. If a dependency of a function is invalidated, the function itself is invalidated, and so are any dependent targets with the `depend` cue turned on.
## Databases
`targets` manages `_targets/meta/progress` and `_targets/meta/meta` with an internal `database` class, which has methods to read, write, and deduplicate entire datasets as well as row-append records for individual targets. To maximize performance, `targets` uses `fread()` and `fwrite()` from `data.table` when working with entire databases and `base::write()` to append individual rows. The `database` class also supports and internal in-memory cache in order to avoid costly interactions with storage.
Internal classes `progress` and `meta` each have a `database` object and methods specific to the use case. And for additional safety, the `record` class encapsulates and validates individual rows of metadata.