Skip to content

Commit

Permalink
Added the Ingest Workflow Developer's Guide
Browse files Browse the repository at this point in the history
  • Loading branch information
iagaponenko committed Oct 2, 2024
1 parent 989e1d3 commit 41c2236
Show file tree
Hide file tree
Showing 10 changed files with 2,501 additions and 55 deletions.
103 changes: 90 additions & 13 deletions doc/ingest/api/index.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,97 @@

.. note::

Information in this guide corresponds to the version **35** of the Qserv REST API. Keep in mind
that each implementation of the API has a specific version. The version number will change
if any changes to the implementation or the API that might affect users will be made.
The current document will be kept updated to reflect the latest version of the API.

#####################################
The Ingest Workflow Developer's Guide
#####################################

TBC
.. toctree::
:maxdepth: 4

reference/index

Introduction
============

This document presents an API that is available in Qserv for constructing the data ingest applications (also mentioned
in the document as *ingest workflows*). The API is designed to provide a high-performance and reliable mechanism for
ingesting large quantities of data where the high performance or reliability of the ingests is at stake.
The document is intended to be a practical guide for the developers who are building those applications.
It provides a high-level overview of the API, its main components, and the typical workflows that can be built using the API.

At the very high level, the Qserv Ingest system is comprised of:

- The REST server that is integrated into the Master Replication Controller. The server provides a collection
of services for managing metadata and states of the new catalogs to be ingested. The server also coordinates
its own operations with Qserv itself and the Qserv Replication System to prevent interferences with those
and minimize failures during catalog ingest activities.
- The Data Ingest services run at each Qserv worker alongside the Replication System's worker services.
The role of these services is to actually ingest the client's data into the corresponding MySQL tables.
The services would also do an additional (albeit, minimal) preprocessing and data transformation (where or when needed)
before ingesting the input data into MySQL. Each worker server also includes its own REST server for processing
the "by reference" ingest requests as well as various metadata requests in the scope of the workers.

Implementation-wise, the Ingest System heavily relies on services and functions of the Replication System including
the Replication System's Controller Framework, various (including the Configuration) services, and the worker-side
server infrastructure of the Replication System.

Client workflows interact with the system's services via open interfaces (based on the HTTP protocol, REST services,
JSON data format, etc.) and use ready-to-use tools to fulfill their goals of ingesting catalogs.

Here is a brief summary of the INgest System's features:

- It introduces well-defined states and semantics into the ingest process. With that, a process of ingesting a new catalog
now has to go through a sequence of specific steps maintaining a progressive state of the catalog within Qserv
while it's being ingested. The state transitions and the corresponding enforcements made by the system would
always ensure that the catalog would be in a consistent state during each step of the process.
Altogether, this model increases the robustness of the process, and it also makes it more efficient.

- To facilitate and implement the above-mentioned state transitions the new system introduces a distributed
checkpointing mechanism named *super-transactions*. These transactions allow for incremental updates of
the overall state while allowing to safely roll back to a prior consistent state should any problem occur
during data loading within such transactions.

- In its very foundation, the system has been designed for constructing high-performance and parallel ingest
workflows w/o compromising the consistency of the ingested catalogs.

- For the actual data loading, the system offers a few options, inluding pushing data into Qserv directly
via a proprietary binary protocol, HTTP streaming, or ingesting contributions "by reference". In the latter
case, the input data will be pulled by the worker services from the remote locations specified by the ingest
workflows. The presently supported sources include the object stores (via the HTTP/HTTPS protocols) and
the locally mounted distributed filesystems (via the POSIX protocol). See Ingesting files directly from
workers for further details.

- The data loading services also collect various information on the ongoing status of the ingest activities,
abnormal conditions that may occur during reading, interpreting, or loading the data into Qserv, as well
as the metadata for the data that is loaded. The information is retained within the persistent
state of the system for the monitoring and debugging purposes. A feedback is provided to the workflows
on various aspects of the ingest activities. The feedback is useful for the workflows to adjust their
behavior and to ensure the quality of the data being ingested.

- To get further info on this subject, see the sections Error reporting (TODO) and Using MySQL warnings (TODO).
In addition, the API provides REST services for obtaining metadata on the state of catalogs, tables, distributed
transactions, contribution requests, the progress of the requested operations, etc.

**What the Ingest System won't do**:

- As per its current implementation (which may change in the future) it will not automatically partition
input files. This task is expected to be the responsibility of the ingest workflows.

- It will not (with the very small exception of adding an extra leading column ``qserv_trans_id`` required by
the implementation of the new system) pre-process the input ``CSV`` files sent to the Ingest Data Servers
for loading into tables. It's up to the workflows to sanitize the input data and to make them ready to be
ingested into Qserv.


.. list-table:: Title
:widths: 25 25 50
:header-rows: 1
More documents on the requirements and the low-level technical details of its implementation (unless it's needed
for the purposes of the document's goals) can be found elsewhere.

* - Heading row 1, column 1
- Heading row 1, column 2
- Heading row 1, column 3
* - Row 1, column 1
-
- Row 1, column 3
* - Row 2, column 1
- Row 2, column 2
- Row 2, column 3
It's recommended to read the document sequentially. Most ideas presented in the document are introduced in section
An example of a simple workflow (TODO: add a link to the section). The section is followed by a few more sections
covering advanced topics (TODO: add a link to the section). The API Reference section at the very end of the document
should be used to find complete descriptions of the REST services and tools mentioned in the document.
10 changes: 10 additions & 0 deletions doc/ingest/api/reference/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

######################
Ingest API Reference
######################

.. toctree::
:maxdepth: 4

rest/index
tools
Loading

0 comments on commit 41c2236

Please sign in to comment.