MAAP-Project · chuckwondo · Jul 3, 2024 · May 17, 2024 · May 18, 2024 · May 20, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,13 +9,34 @@ The format is based on [Keep a Changelog], and this project adheres to
 
 ### Changed
 
-- [#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14) AWS S3
-  credentials are no longer obtained via the `maap-py` library.  Instead, they
-  are obtained via a role using the EC2 instance metadata.
-- [#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72) Log messages
-  now use ISO 8601 UTC combined date and time representations with milliseconds.
-- [#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54) Granule files
-  are no longer downloaded.  Instead, they are read directly from AWS S3.
+- Obtain AWS S3 credentials via a role using the EC2 instance metadata rather
+  than via the `maap-py` library
+  ([#14](https://github.com/MAAP-Project/gedi-subsetter/issues/14))
+- Log messages with timestamps in ISO 8601 UTC combined date and time
+  representations with milliseconds
+  ([#72](https://github.com/MAAP-Project/gedi-subsetter/issues/72))
+- Read granule files directly from AWS S3 instead of downloading them
+  ([#54](https://github.com/MAAP-Project/gedi-subsetter/issues/54))
+- Optimize AWS S3 read performance to provide ~10% speed improvement (on
+  average) over downloading files by tuning the `cache_type`, `block_size`, and
+  `fill` keyword arguments to the `s3fs.S3FileSystem.open` method
+  ([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
+- Set default granule `limit` to 100000.  Although this is not unlimited, it
+  effectively behaves as such because all of the supported GEDI collections have
+  fewer granules than this limit.
+  ([#69](https://github.com/MAAP-Project/gedi-subsetter/issues/69))
+- Set default job queue to `maap-dps-worker-32vcpu-64gb` to improve performance
+  by running on 32 CPUs
+  ([#78](https://github.com/MAAP-Project/gedi-subsetter/issues/78))
+
+### Added
+
+- Add `s3fs_open_kwargs` input to allow user to specify keyword arguments to the
+  `s3fs.S3FileSystem.open` method; see [MAAP_USAGE.md] for details.
+  ([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
+- Add `processes` input to allow user to specify the number of processes to use,
+  defaulting to the number of available CPUs
+  ([#77](https://github.com/MAAP-Project/gedi-subsetter/issues/77))
 
 ## 0.7.0 (2024-04-23)
 
@@ -144,8 +165,10 @@ The format is based on [Keep a Changelog], and this project adheres to
 [fine-grained error locations in tracebacks]:
   https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep657
 [Keep a Changelog]:
-    https://keepachangelog.com/en/1.0.0/
+  https://keepachangelog.com/en/1.0.0/
 [Semantic Versioning]:
-    https://semver.org/spec/v2.0.0.html
+  https://semver.org/spec/v2.0.0.html
 [MAAP-Project/maap-documentation-examples]:
-    https://github.com/MAAP-Project/maap-documentation-examples
+  https://github.com/MAAP-Project/maap-documentation-examples
+[MAAP_USAGE.md]:
+  docs/MAAP_USAGE.md
diff --git a/algorithm_config.yaml b/algorithm_config.yaml
@@ -4,7 +4,7 @@ algorithm_version: 0.7.0
 repository_url: https://github.com/MAAP-Project/gedi-subsetter.git
 docker_container_url: mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.5
 disk_space: 20GB
-queue: maap-dps-worker-32gb
+queue: maap-dps-worker-32vcpu-64gb
 build_command: gedi-subsetter/bin/build-dps
 run_command: gedi-subsetter/bin/subset.sh
 inputs:
@@ -53,15 +53,26 @@ inputs:
       required: false
       default: all
     - name: limit
-      description: Maximum number of GEDI granule data files to download from the CMR,
-        regardless of the number of granules within the AOI.
+      description: Maximum number of GEDI granules to subset, regardless of the number
+        of granules within the spatio-temporal range.
       required: false
-      default: "1000"
+      default: "100_000"
     - name: output
       description: Name of the output file produced by the algorithm.  Defaults to
         using the AOI file name (without the extension) with the suffix "_subset.gpkg".
       required: false
       default: ""
+    - name: s3fs_open_kwargs
+      description: JSON object representing keyword arguments to pass to
+        s3fs.S3FileSystem.open when reading files from S3.  See
+        https://filesystem-spec.readthedocs.io/en/latest/api.html#s3fs.S3FileSystem.open.
+      required: false
+      default: '{"cache_type": "all", "block_size": 8388608, "fill": true}'
+    - name: processes
+      description: Number of processes to use for parallel processing.  If not provided,
+        defaults to the number of available CPUs.
+      required: false
+      default: ""
     - name: scalene_args
       description: Arguments to pass to Scalene for memory and CPU profiling.  If not
         provided, Scalene will not be used.

diff --git a/bin/subset.sh b/bin/subset.sh
@@ -27,7 +27,7 @@ else
     aoi="$(ls "${input_dir}"/*)"
 
     n_actual=${#}
-    n_expected=10
+    n_expected=12
 
     if test ${n_actual} -ne ${n_expected}; then
         echo "Expected ${n_expected} inputs, but got ${n_actual}:$(printf " '%b'" "$@")" >&2
@@ -44,22 +44,34 @@ else
     [[ -n "${7}" ]] && args+=(--beams "${7}")
     [[ -n "${8}" ]] && args+=(--limit "${8}")
     [[ -n "${9}" ]] && args+=(--output "${9}")
-    # Split the 10th argument into an array of arguments to pass to scalene.
-    IFS=' ' read -ra scalene_args <<<"${10}"
+    [[ -n "${10}" ]] && args+=(--s3fs-open-kwargs "${10}")
+    [[ -n "${11}" ]] && args+=(--processes "${11}")
+    # Split the last argument into an array of arguments to pass to scalene.
+    IFS=' ' read -ra scalene_args <<<"${12}"
 
     command=("${subset_py}" "${args[@]}")
 
     if [[ ${#scalene_args[@]} -ne 0 ]]; then
+        ext="html"
+
+        for arg in "${scalene_args[@]}"; do
+            if [[ "${arg}" == "--json" ]]; then
+                ext="json"
+            elif [[ "${arg}" == "--cli" ]]; then
+                ext="txt"
+            fi
+        done
+
         # Force output to be written to the output directory by adding the
         # `--outfile` argument after any user-provided arguments.  If the user
-        # provides their own `--outfile` argument, it will be ignored.
+        # provides their own `--outfile` argument, it will be ignored.  Also,
+        # add `--no-browser` to ensure that scalene does not attempt to open a
+        # browser.
         command=(
             scalene
             "${scalene_args[@]}"
-            --column-width 165
-            --html
             --no-browser
-            --outfile "${output_dir}/profile.html"
+            --outfile "${output_dir}/profile.${ext}"
             ---
             "${command[@]}"
         )

diff --git a/docs/MAAP_USAGE.md b/docs/MAAP_USAGE.md
@@ -31,7 +31,8 @@ At a high level, the GEDI subsetting algorithm does the following:
 
 ## Algorithm Inputs
 
-To run a GEDI subsetting DPS job, you must supply the following inputs:
+To run a GEDI subsetting DPS job, there are a few required inputs and several
+optional inputs:
 
 - `aoi` (_required_): URL to a GeoJSON file representing your area of interest
   (see [Specifying an AOI](#specifying-an-aoi)).  This may contain multiple
@@ -41,23 +42,24 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
   include in the output file.  These names correspond to the _datasets_ (which
   might also be referred to as _variables_ or _layers_ in the DOI documentation)
   within the data files, and vary from collection to collection.  Consult the
-  documentation for a list of datasets available per collection (see
-  [Specifying a DOI](#specifying-a-doi) for documentation links).
+  documentation for each collection for a list of datasets available per
+  collection (see [Specifying a DOI](#specifying-a-doi) for documentation
+  links).
 
   In addition to the specified columns, the output file will also include a
   `filename` (`str`) column that includes the name of the original `h5` file.
 
-  _Changed in version 0.6.0_: The `beam` column is no longer automatically
-  included.  If you wish to include the `beam` column, you must specify it
-  explicitly in this `columns` value.
-
   **IMPORTANT:** To specify nested datasets (i.e., datasets _not_ at the top of
   a BEAM), you may use a path containing forward slashes (`/`) that is relative
   to the BEAM it appears within.  For example, if a BEAM contains a
   `geolocation` group, and within that group is a dataset named
   `sensitivity_a2`, then you would refer to that nested dataset as
   `geolocation/sensitivity_a2`.
 
+  > _Changed in version 0.6.0_: The `beam` column is no longer automatically
+  > included.  If you wish to include the `beam` column, you must specify it
+  > explicitly in this `columns` value.
+
 - `query` (_optional_; default: no query, select all rows): Query expression for
   subsetting the rows in the output file.  This expression selects rows of data
   for which the expression is true.  Again, names in the expression are dataset
@@ -82,43 +84,53 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
   quality_flag == 1 and `geolocation/sensitivity_a2` > 0.95
   ```
 
-- `limit` (_optional_; default: 1_000): Maximum number of GEDI granule data
-  files to download from the CMR, among those that intersect the specified AOI's
-  bounding box, and fall within the specified temporal range (if supplied).
-
-  _Changed in version 0.6.0_: The default value was reduced from 10000 to 1000.
-  The AOI for most subsetting operations are likely to incur a request for well
-  under 1000 granules for downloading, so a larger default value might only lead
-  to longer CMR query times.
+- `limit` (_optional_; default: 100000): Maximum number of GEDI granule data
+  files to subset, among those that intersect the specified AOI's bounding box,
+  and fall within the specified temporal range (if supplied).  If there are more
+  granules within the spatio-temporal range, only the first `limit` number of
+  granules obtained from the corresponding CMR search are used.
+
+  > _Changed in version 0.6.0_: The default value was reduced from 10000 to 1000.
+  > The AOI for most subsetting operations are likely to incur a request for well
+  > under 1000 granules for downloading, so a larger default value might only lead
+  > to longer CMR query times.
+
+  > _Changed in version 0.8.0_: The default value was increased from 1000 to
+  > 100000 to avoid confusion in cases where a user does _not_ specify a limit,
+  > expecting to subset _all_ granules within the specified spatio-temporal
+  > range, but instead subsetting no more than the default limit of 1000, thus
+  > obtaining an unexpectedly incomplete result.  This new limit should
+  > effectively behave as if it were unlimited because all supported GEDI
+  > collections have fewer granules than this default limit.
 
 - `doi` (_required_): [Digital Object Identifier] (DOI) of the GEDI collection
   to subset, or a logical name representing such a DOI (see
   [Specifying a DOI](#specifying-a-doi))
 
-  _New in version 0.3.0_
+  > _Added in version 0.3.0_
 
-- `lat` (_required_): _Name_ of the dataset used for latitude.
+- `lat` (_required_): _Name_ of the dataset used for latitude values.
 
-  _New in version 0.3.0_
+  > _Added in version 0.3.0_
 
-- `lon` (_required_): _Name_ of the dataset used for longitude.
+- `lon` (_required_): _Name_ of the dataset used for longitude values.
 
-  _New in version 0.3.0_
+  > _Added in version 0.3.0_
 
 - `beams` (_optional_; default: `all`): Which beams to include in the subset.
   If supplied, must be one of logical names `all`, `coverage`, or `power`, _OR_
   a comma-separated list of specific beam names, with or without the `BEAM`
   prefix (e.g., `BEAM0000,BEAM0001` or `0000,0001`)
 
-  _New in version 0.4.0_
+  > _Added in version 0.4.0_
 
 - `temporal` (_optional_; default: full temporal range available): Temporal
   range to subset.  You may specify either a closed range, with start and end
   dates, or a half-open range, with either a start date or an end date.  For
   full details on the valid formats, see the NASA CMR's documentation on
   [temporal range searches](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#temporal-range-searches).
 
-  _New in version 0.6.0_
+  > _Added in version 0.6.0_
 
 - `output` (_optional_): Name to use for the output file.  This can also include
   a path, which will be relative to the standard DPS output directory for a job.
@@ -139,35 +151,51 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
   - `myoutput.h5` -> `myoutput.gpkg`
   - `mypath/myoutput` -> `mypath/myoutput.gpkg`
 
-  _New in version 0.6.0_
+  > _Added in version 0.6.0_
+
+- `s3fs_open_kwargs` (_optional_; default:
+  `'{"cache_type": "all", "block_size": 8388608, "fill": true}'`): JSON string
+  to pass as keyword arguments to [s3fs.S3FileSystem.open] when reading granule
+  files from S3.  The default value was chosen to provide optimal speed, after
+  conducting performance profiling for various combinations of possible values,
+  so it should be unnecessary to supply this input.
 
-- `scalene_args` (_optional_): Arguments to pass to [Scalene] for performance
-  profiling.  Normal usage should leave this argument blank.
+  > _Added in version 0.8.0_
 
-  Fill this in if you want to collect performance metrics (i.e.  CPU and RAM
-  usage).  The recommended value for this input is `--reduced-profile` (see
-  below for more advanced usage).  When used, you will find `profile.html` in
-  your algorithm output folder.
+- `processes` (_optional_; default: number of available CPUs): Number of
+  processes to use for parallel processing.
+
+  > _Added in version 0.8.0_
+
+- `scalene_args` (_optional_; default: none): Arguments to pass to [Scalene] for
+  performance profiling.  Normal usage should leave this argument blank, meaning
+  that Scalene will _not_ be used.
 
   When this input is supplied, the algorithm will be run via the `scalene`
-  command, and the value of this input will be passed as arguments to the
-  command.  For a list of the available command-line options, see
+  command for collecting performance metrics (i.e.  CPU and RAM usage), and the
+  value of this input will be passed as arguments to the command.  For a list of
+  available command-line options, see
   <https://github.com/plasma-umass/scalene?tab=readme-ov-file#scalene>.
 
-  Starting with `--reduced-profile` produces a relatively brief report that may
-  aid in more quickly identifying hotspots than a full profile would.  However,
-  to produce a full profile where you want to use all of Scalene's default
-  values, you must supply _some_ value for this input, so the simplest valid
-  Scalene option is `--on`.  Otherwise, as mentioned above, when no value is
-  supplied for this input, Scalene will not be used at all.
+  By default, the name of the profile output file is `profile.html` (placed in
+  your job's output folder).  If you specify the `--json` flag, it will be named
+  `profile.json`.  If you specify the `--cli` flag, it will be named
+  `profile.txt`.
+
+  If you want to use all of Scalene's default values (i.e.  not specify any
+  override values), you cannot leave this input blank, otherwise Scalene will
+  not be used at all (as mentioned above).  In this case, you must supply _some_
+  value for this input, so the simplest valid Scalene option is `--on`.
+
+  **Note:** Since no browser is available in DPS, when any value is supplied for
+  this input, the `--no-browser` option will be included to prevent Scalene from
+  attempting to open a browser.
 
-  > **Note:** Since no browser is available in DPS, when any value is
-  > supplied for this input, the `--no-browser` option will be included to
-  > prevent Scalene from attempting to open a browser.  However, the `--web`
-  > option will also be included, which will produce HTML output to a file named
-  > `profile.html`.
+  > _Added in version 0.7.0_
 
-  _New in version 0.7.0_
+  > _Changed in version 0.8.0_: Specifying the `--json` flag changes the name of
+  > the profile output file to `profile.json` and specifying `--cli` changes it
+  > to `profile.txt`.
 
 ### Specifying an AOI
 
@@ -463,3 +491,5 @@ administrative boundaries.  PLoS ONE 15(4): e0231866.
   https://www.geoboundaries.org/api.html
 [Scalene]:
   https://github.com/plasma-umass/scalene
+[s3fs.S3FileSystem.open]:
+  https://filesystem-spec.readthedocs.io/en/latest/api.html#s3fs.S3FileSystem.open