Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune s3fs cache settings for optimal performance #81

Merged
merged 11 commits into from
Jul 3, 2024
24 changes: 24 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -134,3 +134,27 @@ dmypy.json

# Pyre type checker
.pyre/

# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode
# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode

### VisualStudioCode ###
.vscode/*
#!.vscode/settings.json
#!.vscode/tasks.json
#!.vscode/launch.json
#!.vscode/extensions.json
#!.vscode/*.code-snippets

# Local History for Visual Studio Code
.history/

# Built Visual Studio Code Extensions
*.vsix

### VisualStudioCode Patch ###
# Ignore all local history of files
.history
.ionide

# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode
14 changes: 0 additions & 14 deletions .vscode/settings.json

This file was deleted.

44 changes: 34 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,35 @@ The format is based on [Keep a Changelog], and this project adheres to

### Changed

- [#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14) AWS S3
credentials are no longer obtained via the `maap-py` library. Instead, they
are obtained via a role using the EC2 instance metadata.
- [#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72) Log messages
now use ISO 8601 UTC combined date and time representations with milliseconds.
- [#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54) Granule files
are no longer downloaded. Instead, they are read directly from AWS S3.
- Obtain AWS S3 credentials via a role using the EC2 instance metadata rather
than via the `maap-py` library
([#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14))
- Log messages with timestamps in ISO 8601 UTC combined date and time
representations with milliseconds
([#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72))
- Read granule files directly from AWS S3 instead of downloading them
([#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54))
- Optimize AWS S3 read performance to provide ~10% speed improvement (on
average) over downloading files by tuning the `default_cache_type`,
`default_block_size`, and `default_fill_cache` keyword arguments to the
`fsspec.url_to_fs` function
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
- Set default granule `limit` to 100000. Although this is not unlimited, it
effectively behaves as such because all of the supported GEDI collections have
fewer granules than this limit.
([#69](https://github.com/MAAP-Project/gedi-subsetter/issues/69))
- Set default job queue to `maap-dps-worker-32vcpu-64gb` to improve performance
by running on 32 CPUs
([#78](https://github.com/MAAP-Project/gedi-subsetter/issues/78))

### Added

- Add `fsspec_kwargs` input to allow user to specify keyword arguments to the
`fsspec.url_to_fs` method; see [MAAP_USAGE.md] for details.
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
- Add `processes` input to allow user to specify the number of processes to use,
defaulting to the number of available CPUs
([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))

## 0.7.0 (2024-04-23)

Expand Down Expand Up @@ -144,8 +166,10 @@ The format is based on [Keep a Changelog], and this project adheres to
[fine-grained error locations in tracebacks]:
https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep657
[Keep a Changelog]:
https://keepachangelog.com/en/1.0.0/
https://keepachangelog.com/en/1.0.0/
[Semantic Versioning]:
https://semver.org/spec/v2.0.0.html
https://semver.org/spec/v2.0.0.html
[MAAP-Project/maap-documentation-examples]:
https://github.com/MAAP-Project/maap-documentation-examples
https://github.com/MAAP-Project/maap-documentation-examples
[MAAP_USAGE.md]:
docs/MAAP_USAGE.md
19 changes: 15 additions & 4 deletions algorithm_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ algorithm_version: 0.7.0
repository_url: https://github.com/MAAP-Project/gedi-subsetter.git
docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5
disk_space: 20GB
queue: maap-dps-worker-32gb
queue: maap-dps-worker-32vcpu-64gb
build_command: gedi-subsetter/bin/build-dps
run_command: gedi-subsetter/bin/subset.sh
inputs:
Expand Down Expand Up @@ -53,15 +53,26 @@ inputs:
required: false
default: all
- name: limit
description: Maximum number of GEDI granule data files to download from the CMR,
regardless of the number of granules within the AOI.
description: Maximum number of GEDI granules to subset, regardless of the number
of granules within the spatio-temporal range.
required: false
default: "1000"
default: "100_000"
wildintellect marked this conversation as resolved.
Show resolved Hide resolved
- name: output
description: Name of the output file produced by the algorithm. Defaults to
using the AOI file name (without the extension) with the suffix "_subset.gpkg".
required: false
default: ""
- name: fsspec_kwargs
description: "JSON object representing keyword arguments to pass to the
fsspec.core.url_to_fs function when reading granule data files. See
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.url_to_fs."
required: false
default: '{"default_cache_type": "all", "default_block_size": 8388608}'
- name: processes
description: Number of processes to use for parallel processing. If not provided,
defaults to the number of available CPUs.
required: false
default: ""
- name: scalene_args
description: Arguments to pass to Scalene for memory and CPU profiling. If not
provided, Scalene will not be used.
Expand Down
26 changes: 19 additions & 7 deletions bin/subset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ else
aoi="$(ls "${input_dir}"/*)"

n_actual=${#}
n_expected=10
n_expected=12

if test ${n_actual} -ne ${n_expected}; then
echo "Expected ${n_expected} inputs, but got ${n_actual}:$(printf " '%b'" "$@")" >&2
Expand All @@ -44,22 +44,34 @@ else
[[ -n "${7}" ]] && args+=(--beams "${7}")
[[ -n "${8}" ]] && args+=(--limit "${8}")
[[ -n "${9}" ]] && args+=(--output "${9}")
# Split the 10th argument into an array of arguments to pass to scalene.
IFS=' ' read -ra scalene_args <<<"${10}"
[[ -n "${10}" ]] && args+=(--fsspec-kwargs "${10}")
[[ -n "${11}" ]] && args+=(--processes "${11}")
# Split the last argument into an array of arguments to pass to scalene.
IFS=' ' read -ra scalene_args <<<"${12}"

command=("${subset_py}" "${args[@]}")

if [[ ${#scalene_args[@]} -ne 0 ]]; then
ext="html"

for arg in "${scalene_args[@]}"; do
if [[ "${arg}" == "--json" ]]; then
ext="json"
elif [[ "${arg}" == "--cli" ]]; then
ext="txt"
fi
done

# Force output to be written to the output directory by adding the
# `--outfile` argument after any user-provided arguments. If the user
# provides their own `--outfile` argument, it will be ignored.
# provides their own `--outfile` argument, it will be ignored. Also,
# add `--no-browser` to ensure that scalene does not attempt to open a
# browser.
command=(
scalene
"${scalene_args[@]}"
--column-width 165
--html
--no-browser
--outfile "${output_dir}/profile.html"
--outfile "${output_dir}/profile.${ext}"
---
"${command[@]}"
)
Expand Down
Loading