Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune s3fs cache settings for optimal performance #81

Merged
merged 11 commits into from
Jul 3, 2024
43 changes: 33 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,34 @@ The format is based on [Keep a Changelog], and this project adheres to

### Changed

- [#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14) AWS S3
credentials are no longer obtained via the `maap-py` library. Instead, they
are obtained via a role using the EC2 instance metadata.
- [#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72) Log messages
now use ISO 8601 UTC combined date and time representations with milliseconds.
- [#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54) Granule files
are no longer downloaded. Instead, they are read directly from AWS S3.
- Obtain AWS S3 credentials via a role using the EC2 instance metadata rather
than via the `maap-py` library
([#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14))
- Log messages with timestamps in ISO 8601 UTC combined date and time
representations with milliseconds
([#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72))
- Read granule files directly from AWS S3 instead of downloading them
([#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54))
- Optimize AWS S3 read performance to provide ~10% speed improvement (on
average) over downloading files by tuning the `cache_type`, `block_size`, and
`fill` keyword arguments to the `s3fs.S3FileSystem.open` method
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
- Set default granule `limit` to 100000. Although this is not unlimited, it
effectively behaves as such because all of the supported GEDI collections have
fewer granules than this limit.
([#69](https://github.com/MAAP-Project/gedi-subsetter/issues/69))
- Set default job queue to `maap-dps-worker-32vcpu-64gb` to improve performance
by running on 32 CPUs
([#78](https://github.com/MAAP-Project/gedi-subsetter/issues/78))

### Added

- Add `s3fs_open_kwargs` input to allow user to specify keyword arguments to the
`s3fs.S3FileSystem.open` method; see [MAAP_USAGE.md] for details.
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
- Add `processes` input to allow user to specify the number of processes to use,
defaulting to the number of available CPUs
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))

## 0.7.0 (2024-04-23)

Expand Down Expand Up @@ -144,8 +165,10 @@ The format is based on [Keep a Changelog], and this project adheres to
[fine-grained error locations in tracebacks]:
https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep657
[Keep a Changelog]:
https://keepachangelog.com/en/1.0.0/
https://keepachangelog.com/en/1.0.0/
[Semantic Versioning]:
https://semver.org/spec/v2.0.0.html
https://semver.org/spec/v2.0.0.html
[MAAP-Project/maap-documentation-examples]:
https://github.com/MAAP-Project/maap-documentation-examples
https://github.com/MAAP-Project/maap-documentation-examples
[MAAP_USAGE.md]:
docs/MAAP_USAGE.md
19 changes: 15 additions & 4 deletions algorithm_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ algorithm_version: 0.7.0
repository_url: https://github.com/MAAP-Project/gedi-subsetter.git
docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5
disk_space: 20GB
queue: maap-dps-worker-32gb
queue: maap-dps-worker-32vcpu-64gb
build_command: gedi-subsetter/bin/build-dps
run_command: gedi-subsetter/bin/subset.sh
inputs:
Expand Down Expand Up @@ -53,15 +53,26 @@ inputs:
required: false
default: all
- name: limit
description: Maximum number of GEDI granule data files to download from the CMR,
regardless of the number of granules within the AOI.
description: Maximum number of GEDI granules to subset, regardless of the number
of granules within the spatio-temporal range.
required: false
default: "1000"
default: "100_000"
wildintellect marked this conversation as resolved.
Show resolved Hide resolved
- name: output
description: Name of the output file produced by the algorithm. Defaults to
using the AOI file name (without the extension) with the suffix "_subset.gpkg".
required: false
default: ""
- name: s3fs_open_kwargs
wildintellect marked this conversation as resolved.
Show resolved Hide resolved
description: JSON object representing keyword arguments to pass to
s3fs.S3FileSystem.open when reading files from S3. See
https://filesystem-spec.readthedocs.io/en/latest/api.html#s3fs.S3FileSystem.open.
required: false
default: '{"cache_type": "all", "block_size": 8388608, "fill": true}'
- name: processes
description: Number of processes to use for parallel processing. If not provided,
defaults to the number of available CPUs.
required: false
default: ""
- name: scalene_args
description: Arguments to pass to Scalene for memory and CPU profiling. If not
provided, Scalene will not be used.
Expand Down
26 changes: 19 additions & 7 deletions bin/subset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ else
aoi="$(ls "${input_dir}"/*)"

n_actual=${#}
n_expected=10
n_expected=12

if test ${n_actual} -ne ${n_expected}; then
echo "Expected ${n_expected} inputs, but got ${n_actual}:$(printf " '%b'" "$@")" >&2
Expand All @@ -44,22 +44,34 @@ else
[[ -n "${7}" ]] && args+=(--beams "${7}")
[[ -n "${8}" ]] && args+=(--limit "${8}")
[[ -n "${9}" ]] && args+=(--output "${9}")
# Split the 10th argument into an array of arguments to pass to scalene.
IFS=' ' read -ra scalene_args <<<"${10}"
[[ -n "${10}" ]] && args+=(--s3fs-open-kwargs "${10}")
[[ -n "${11}" ]] && args+=(--processes "${11}")
# Split the last argument into an array of arguments to pass to scalene.
IFS=' ' read -ra scalene_args <<<"${12}"

command=("${subset_py}" "${args[@]}")

if [[ ${#scalene_args[@]} -ne 0 ]]; then
ext="html"

for arg in "${scalene_args[@]}"; do
if [[ "${arg}" == "--json" ]]; then
ext="json"
elif [[ "${arg}" == "--cli" ]]; then
ext="txt"
fi
done

# Force output to be written to the output directory by adding the
# `--outfile` argument after any user-provided arguments. If the user
# provides their own `--outfile` argument, it will be ignored.
# provides their own `--outfile` argument, it will be ignored. Also,
# add `--no-browser` to ensure that scalene does not attempt to open a
# browser.
command=(
scalene
"${scalene_args[@]}"
--column-width 165
--html
--no-browser
--outfile "${output_dir}/profile.html"
--outfile "${output_dir}/profile.${ext}"
---
"${command[@]}"
)
Expand Down
116 changes: 73 additions & 43 deletions docs/MAAP_USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ At a high level, the GEDI subsetting algorithm does the following:

## Algorithm Inputs

To run a GEDI subsetting DPS job, you must supply the following inputs:
To run a GEDI subsetting DPS job, there are a few required inputs and several
optional inputs:

- `aoi` (_required_): URL to a GeoJSON file representing your area of interest
(see [Specifying an AOI](#specifying-an-aoi)). This may contain multiple
Expand All @@ -41,23 +42,24 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
include in the output file. These names correspond to the _datasets_ (which
might also be referred to as _variables_ or _layers_ in the DOI documentation)
within the data files, and vary from collection to collection. Consult the
documentation for a list of datasets available per collection (see
[Specifying a DOI](#specifying-a-doi) for documentation links).
documentation for each collection for a list of datasets available per
collection (see [Specifying a DOI](#specifying-a-doi) for documentation
links).

In addition to the specified columns, the output file will also include a
`filename` (`str`) column that includes the name of the original `h5` file.

_Changed in version 0.6.0_: The `beam` column is no longer automatically
included. If you wish to include the `beam` column, you must specify it
explicitly in this `columns` value.

**IMPORTANT:** To specify nested datasets (i.e., datasets _not_ at the top of
a BEAM), you may use a path containing forward slashes (`/`) that is relative
to the BEAM it appears within. For example, if a BEAM contains a
`geolocation` group, and within that group is a dataset named
`sensitivity_a2`, then you would refer to that nested dataset as
`geolocation/sensitivity_a2`.

> _Changed in version 0.6.0_: The `beam` column is no longer automatically
> included. If you wish to include the `beam` column, you must specify it
> explicitly in this `columns` value.

- `query` (_optional_; default: no query, select all rows): Query expression for
subsetting the rows in the output file. This expression selects rows of data
for which the expression is true. Again, names in the expression are dataset
Expand All @@ -82,43 +84,53 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
quality_flag == 1 and `geolocation/sensitivity_a2` > 0.95
```

- `limit` (_optional_; default: 1_000): Maximum number of GEDI granule data
files to download from the CMR, among those that intersect the specified AOI's
bounding box, and fall within the specified temporal range (if supplied).

_Changed in version 0.6.0_: The default value was reduced from 10000 to 1000.
The AOI for most subsetting operations are likely to incur a request for well
under 1000 granules for downloading, so a larger default value might only lead
to longer CMR query times.
- `limit` (_optional_; default: 100000): Maximum number of GEDI granule data
files to subset, among those that intersect the specified AOI's bounding box,
and fall within the specified temporal range (if supplied). If there are more
granules within the spatio-temporal range, only the first `limit` number of
granules obtained from the corresponding CMR search are used.

> _Changed in version 0.6.0_: The default value was reduced from 10000 to 1000.
> The AOI for most subsetting operations are likely to incur a request for well
> under 1000 granules for downloading, so a larger default value might only lead
> to longer CMR query times.

> _Changed in version 0.8.0_: The default value was increased from 1000 to
> 100000 to avoid confusion in cases where a user does _not_ specify a limit,
wildintellect marked this conversation as resolved.
Show resolved Hide resolved
> expecting to subset _all_ granules within the specified spatio-temporal
> range, but instead subsetting no more than the default limit of 1000, thus
> obtaining an unexpectedly incomplete result. This new limit should
> effectively behave as if it were unlimited because all supported GEDI
> collections have fewer granules than this default limit.

- `doi` (_required_): [Digital Object Identifier] (DOI) of the GEDI collection
to subset, or a logical name representing such a DOI (see
[Specifying a DOI](#specifying-a-doi))

_New in version 0.3.0_
> _Added in version 0.3.0_

- `lat` (_required_): _Name_ of the dataset used for latitude.
- `lat` (_required_): _Name_ of the dataset used for latitude values.

_New in version 0.3.0_
> _Added in version 0.3.0_

- `lon` (_required_): _Name_ of the dataset used for longitude.
- `lon` (_required_): _Name_ of the dataset used for longitude values.

_New in version 0.3.0_
> _Added in version 0.3.0_

- `beams` (_optional_; default: `all`): Which beams to include in the subset.
If supplied, must be one of logical names `all`, `coverage`, or `power`, _OR_
a comma-separated list of specific beam names, with or without the `BEAM`
prefix (e.g., `BEAM0000,BEAM0001` or `0000,0001`)

_New in version 0.4.0_
> _Added in version 0.4.0_

- `temporal` (_optional_; default: full temporal range available): Temporal
range to subset. You may specify either a closed range, with start and end
dates, or a half-open range, with either a start date or an end date. For
full details on the valid formats, see the NASA CMR's documentation on
[temporal range searches](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#temporal-range-searches).

_New in version 0.6.0_
> _Added in version 0.6.0_

- `output` (_optional_): Name to use for the output file. This can also include
a path, which will be relative to the standard DPS output directory for a job.
Expand All @@ -139,35 +151,51 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
- `myoutput.h5` -> `myoutput.gpkg`
- `mypath/myoutput` -> `mypath/myoutput.gpkg`

_New in version 0.6.0_
> _Added in version 0.6.0_

- `s3fs_open_kwargs` (_optional_; default:
`'{"cache_type": "all", "block_size": 8388608, "fill": true}'`): JSON string
to pass as keyword arguments to [s3fs.S3FileSystem.open] when reading granule
files from S3. The default value was chosen to provide optimal speed, after
conducting performance profiling for various combinations of possible values,
so it should be unnecessary to supply this input.

- `scalene_args` (_optional_): Arguments to pass to [Scalene] for performance
profiling. Normal usage should leave this argument blank.
> _Added in version 0.8.0_

Fill this in if you want to collect performance metrics (i.e. CPU and RAM
usage). The recommended value for this input is `--reduced-profile` (see
below for more advanced usage). When used, you will find `profile.html` in
your algorithm output folder.
- `processes` (_optional_; default: number of available CPUs): Number of
processes to use for parallel processing.

> _Added in version 0.8.0_

- `scalene_args` (_optional_; default: none): Arguments to pass to [Scalene] for
performance profiling. Normal usage should leave this argument blank, meaning
that Scalene will _not_ be used.

When this input is supplied, the algorithm will be run via the `scalene`
command, and the value of this input will be passed as arguments to the
command. For a list of the available command-line options, see
command for collecting performance metrics (i.e. CPU and RAM usage), and the
value of this input will be passed as arguments to the command. For a list of
available command-line options, see
<https://github.com/plasma-umass/scalene?tab=readme-ov-file#scalene>.

Starting with `--reduced-profile` produces a relatively brief report that may
aid in more quickly identifying hotspots than a full profile would. However,
to produce a full profile where you want to use all of Scalene's default
values, you must supply _some_ value for this input, so the simplest valid
Scalene option is `--on`. Otherwise, as mentioned above, when no value is
supplied for this input, Scalene will not be used at all.
By default, the name of the profile output file is `profile.html` (placed in
your job's output folder). If you specify the `--json` flag, it will be named
`profile.json`. If you specify the `--cli` flag, it will be named
`profile.txt`.

If you want to use all of Scalene's default values (i.e. not specify any
override values), you cannot leave this input blank, otherwise Scalene will
not be used at all (as mentioned above). In this case, you must supply _some_
value for this input, so the simplest valid Scalene option is `--on`.

**Note:** Since no browser is available in DPS, when any value is supplied for
this input, the `--no-browser` option will be included to prevent Scalene from
attempting to open a browser.

> **Note:** Since no browser is available in DPS, when any value is
> supplied for this input, the `--no-browser` option will be included to
> prevent Scalene from attempting to open a browser. However, the `--web`
> option will also be included, which will produce HTML output to a file named
> `profile.html`.
> _Added in version 0.7.0_

_New in version 0.7.0_
> _Changed in version 0.8.0_: Specifying the `--json` flag changes the name of
> the profile output file to `profile.json` and specifying `--cli` changes it
> to `profile.txt`.

### Specifying an AOI

Expand Down Expand Up @@ -463,3 +491,5 @@ administrative boundaries. PLoS ONE 15(4): e0231866.
https://www.geoboundaries.org/api.html
[Scalene]:
https://github.com/plasma-umass/scalene
[s3fs.S3FileSystem.open]:
https://filesystem-spec.readthedocs.io/en/latest/api.html#s3fs.S3FileSystem.open
Loading