Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial PoC implementation of UDPJobFactory #644

Merged
merged 22 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
ec982c2
Issue #604 initial PoC implementation of UDPJobFactory
soxofaan Oct 10, 2024
4e04351
Issue #604/#644 refactor out parse_remote_process_definition as stand…
soxofaan Oct 11, 2024
04bc791
Issue #604/#644 explicitly use default value from schema as fallback
soxofaan Oct 11, 2024
ff8b553
Issue #604/#644 add UDPJobFactory+MultiBackendJobManager tests
soxofaan Oct 11, 2024
64cafcf
Issue #645 introduce returning event stats from MultiBackendJobManage…
soxofaan Oct 11, 2024
04296c1
Issue #604/#644 more UDPJobFactory+MultiBackendJobManager tests
soxofaan Oct 11, 2024
f8db877
Issue #604/#644 UDPJobFactory: improve geometry support
soxofaan Oct 11, 2024
ff9c3f2
Issue #604/#644 test coverage for personal UDP mode
soxofaan Oct 11, 2024
632e239
Issue #604/#644 test coverage for geometry handling after resume
soxofaan Oct 11, 2024
5b1e8fa
MultiBackendJobManager: fix another SettingWithCopyWarning related bug
soxofaan Oct 11, 2024
fd9fdb8
Issue #604/#644 test coverage for resuming: also parquet
soxofaan Oct 11, 2024
ade5258
Issue #604/#644 fix title/description
soxofaan Oct 11, 2024
d29d3ee
Issue #604/#644 changelog entry and usage example
soxofaan Oct 14, 2024
a92e47f
Issue #604/#644 add process_id to docs
soxofaan Oct 14, 2024
84ee8ea
Issue #604/#644 UDPJobFactory: make process_id optional (if namespace…
soxofaan Oct 14, 2024
ddc9ee5
Issue #604/#644 replace lru_cache trick with cleaner cache
soxofaan Oct 14, 2024
e22a791
Issue #604/#644 further documentation finetuning
soxofaan Oct 14, 2024
130db87
Issue #604/#644 add tests for parameter_column_map
soxofaan Oct 14, 2024
2667733
Issue #604/#644 add some todo note for furtherdevelopment ideas
soxofaan Oct 14, 2024
1917a73
Issue #604/#644 finetune docs based on review
soxofaan Oct 16, 2024
8ffa2a6
Issue #604/#644 Rename UDPJobFactory to ProcessBasedJobCreator
soxofaan Oct 16, 2024
7bf73de
Issue #604/#644 move ProcessBasedJobCreator example to more extensive…
soxofaan Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `MultiBackendJobManager`: Added `initialize_from_df(df)` (to `CsvJobDatabase` and `ParquetJobDatabase`) to initialize (and persist) the job database from a given DataFrame.
Also added `create_job_db()` factory to easily create a job database from a given dataframe and its type guessed from filename extension.
([#635](https://github.com/Open-EO/openeo-python-client/issues/635))


- `MultiBackendJobManager.run_jobs()` now returns a dictionary with counters/stats about various events during the full run of the job manager ([#645](https://github.com/Open-EO/openeo-python-client/issues/645))
- Added (experimental) `ProcessBasedJobCreator` to be used as `start_job` callable with `MultiBackendJobManager` to create multiple jobs from a single parameterized process (e.g. a UDP or remote process definition) ([#604](https://github.com/Open-EO/openeo-python-client/issues/604))

### Changed

Expand Down
106 changes: 106 additions & 0 deletions docs/cookbook/job_manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
Multi Backend Job Manager
====================================

API
===

.. warning::
This is a new experimental API, subject to change.

Expand All @@ -14,3 +17,106 @@ Multi Backend Job Manager
.. autoclass:: openeo.extra.job_management.CsvJobDatabase

.. autoclass:: openeo.extra.job_management.ParquetJobDatabase


.. autoclass:: openeo.extra.job_management.ProcessBasedJobCreator
:members:
:special-members: __call__


.. _job-management-with-process-based-job-creator:

Job creation based on parameterized processes
===============================================

The openEO API supports parameterized processes out of the box,
which allows to work with flexible, reusable openEO building blocks
in the form of :ref:`user-defined processes <user-defined-processes>`
or `remote openEO process definitions <https://github.com/Open-EO/openeo-api/tree/draft/extensions/remote-process-definition>`_.
This can also be leveraged for job creation in the context of the
:py:class:`~openeo.extra.job_management.MultiBackendJobManager`:
define a "template" job as a parameterized process
and let the job manager fill in the parameters
from a given data frame.

The :py:class:`~openeo.extra.job_management.ProcessBasedJobCreator` helper class
allows to do exactly that.
Given a reference to a parameterized process,
such as a user-defined process or remote process definition,
it can be used directly as ``start_job`` callable to
:py:meth:`~openeo.extra.job_management.MultiBackendJobManager.run_jobs`
which will fill in the process parameters from the dataframe.

Basic :py:class:`~openeo.extra.job_management.ProcessBasedJobCreator` example
-----------------------------------------------------------------------------

Basic usage example with a remote process definition:

.. code-block:: python
:linenos:
:caption: Basic :py:class:`~openeo.extra.job_management.ProcessBasedJobCreator` example snippet
:emphasize-lines: 10-15, 28

from openeo.extra.job_management import (
MultiBackendJobManager,
create_job_db,
ProcessBasedJobCreator,
)

# Job creator, based on a parameterized openEO process
# (specified by the remote process definition at given URL)
# which has parameters "start_date" and "bands" for example.
job_starter = ProcessBasedJobCreator(
namespace="https://example.com/my_process.json",
parameter_defaults={
"bands": ["B02", "B03"],
},
)

# Initialize job database from a dataframe,
# with desired parameter values to fill in.
df = pd.DataFrame({
"start_date": ["2021-01-01", "2021-02-01", "2021-03-01"],
})
job_db = create_job_db("jobs.csv").initialize_from_df(df)

# Create and run job manager,
# which will start a job for each of the `start_date` values in the dataframe
# and use the default band list ["B02", "B03"] for the "bands" parameter.
job_manager = MultiBackendJobManager(...)
job_manager.run_jobs(job_db=job_db, start_job=job_starter)

In this example, a :py:class:`ProcessBasedJobCreator` is instantiated
based on a remote process definition,
which has parameters ``start_date`` and ``bands``.
When passed to :py:meth:`~openeo.extra.job_management.MultiBackendJobManager.run_jobs`,
a job for each row in the dataframe will be created,
with parameter values based on matching columns in the dataframe:

- the ``start_date`` parameter will be filled in
with the values from the "start_date" column of the dataframe,
- the ``bands`` parameter has no corresponding column in the dataframe,
and will get its value from the default specified in the ``parameter_defaults`` argument.


:py:class:`~openeo.extra.job_management.ProcessBasedJobCreator` with geometry handling
---------------------------------------------------------------------------------------------

Apart from the intuitive name-based parameter-column linking,
:py:class:`~openeo.extra.job_management.ProcessBasedJobCreator`
also automatically links:

- a process parameters that accepts inline GeoJSON geometries/features
(which practically means it has a schema like ``{"type": "object", "subtype": "geojson"}``,
as produced by :py:meth:`Parameter.geojson <openeo.api.process.Parameter.geojson>`).
- with the geometry column in a `GeoPandas <https://geopandas.org/>`_ dataframe.

even if the name of the parameter does not exactly match
the name of the GeoPandas geometry column (``geometry`` by default).
This automatic liking is only done if there is only one
GeoJSON parameter and one geometry column in the dataframe.


.. admonition:: to do

Add example with geometry handling.
15 changes: 14 additions & 1 deletion docs/rst-cheatsheet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,15 @@ More explicit code block with language hint (and no need for double colon)
>>> 3 + 5
8

Code block with additional features (line numbers, caption, highlighted lines,
for more see https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#directive-code-block)

.. code-block:: python
:linenos:
:caption: how to say hello
:emphasize-lines: 1

print("hello world")


References:
Expand All @@ -60,4 +69,8 @@ References:

- refer to the reference with::

:ref:`target`
:ref:`target` or :ref:`custom text <target>`

- inline URL references::

`Python <https://www.python.org/>`_
Loading