Added the Ingest Workflow Developer's Guide

lsst · Oct 2, 2024 · 41c2236 · 41c2236
1 parent 989e1d3
commit 41c2236
Show file tree

Hide file tree

Showing 10 changed files with 2,501 additions and 55 deletions.
diff --git a/doc/ingest/api/index.rst b/doc/ingest/api/index.rst
@@ -1,20 +1,97 @@
+
+.. note::
+
+   Information in this guide corresponds to the version **35** of the Qserv REST API. Keep in mind
+   that each implementation of the API has a specific version. The version number will change
+   if any changes to the implementation or the API that might affect users will be made.
+   The current document will be kept updated to reflect the latest version of the API.
+
 #####################################
 The Ingest Workflow Developer's Guide
 #####################################
 
-TBC
+.. toctree::
+   :maxdepth: 4
+
+   reference/index
+
+Introduction
+============
+
+This document presents an API that is available in Qserv for constructing the data ingest applications (also mentioned
+in the document as *ingest workflows*). The API is designed to provide a high-performance and reliable mechanism for
+ingesting large quantities of data where the high performance or reliability of the ingests is at stake.
+The document is intended to be a practical guide for the developers who are building those applications.
+It provides a high-level overview of the API, its main components, and the typical workflows that can be built using the API.
+
+At the very high level, the Qserv Ingest system is comprised of:
+
+- The REST server that is integrated into the Master Replication Controller. The server provides a collection
+  of services for managing metadata and states of the new catalogs to be ingested. The server also coordinates
+  its own operations with Qserv itself and the Qserv Replication System to prevent interferences with those
+  and minimize failures during catalog ingest activities.
+- The Data Ingest services run at each Qserv worker alongside the Replication System's worker services.
+  The role of these services is to actually ingest the client's data into the corresponding MySQL tables.
+  The services would also do an additional (albeit, minimal) preprocessing and data transformation (where or when needed)
+  before ingesting the input data into MySQL. Each worker server also includes its own REST server for processing
+  the "by reference" ingest requests as well as various metadata requests in the scope of the workers.
+
+Implementation-wise, the Ingest System heavily relies on services and functions of the Replication System including
+the Replication System's Controller Framework, various (including the Configuration) services, and the worker-side
+server infrastructure of the Replication System.
+
+Client workflows interact with the system's services via open interfaces (based on the HTTP protocol, REST services,
+JSON data format, etc.) and use ready-to-use tools to fulfill their goals of ingesting catalogs.
+
+Here is a brief summary of the INgest System's features:
+
+- It introduces well-defined states and semantics into the ingest process. With that, a process of ingesting a new catalog
+  now has to go through a sequence of specific steps maintaining a progressive state of the catalog within Qserv
+  while it's being ingested. The state transitions and the corresponding enforcements made by the system would
+  always ensure that the catalog would be in a consistent state during each step of the process.
+  Altogether, this model increases the robustness of the process, and it also makes it more efficient.
+
+- To facilitate and implement the above-mentioned state transitions the new system introduces a distributed
+  checkpointing mechanism named *super-transactions*. These transactions allow for incremental updates of
+  the overall state while allowing to safely roll back to a prior consistent state should any problem occur
+  during data loading within such transactions.
+
+- In its very foundation, the system has been designed for constructing high-performance and parallel ingest
+  workflows w/o compromising the consistency of the ingested catalogs.
+
+- For the actual data loading, the system offers a few options, inluding pushing data into Qserv directly
+  via a proprietary binary protocol, HTTP streaming, or ingesting contributions "by reference". In the latter
+  case, the input data will be pulled by the worker services from the remote locations specified by the ingest
+  workflows. The presently supported sources include the object stores (via the HTTP/HTTPS protocols) and
+  the locally mounted distributed filesystems (via the POSIX protocol). See Ingesting files directly from
+  workers for further details.
+
+- The data loading services also collect various information on the ongoing status of the ingest activities,
+  abnormal conditions that may occur during reading, interpreting, or loading the data into Qserv, as well
+  as the metadata for the data that is loaded. The information is retained within the persistent
+  state of the system for the monitoring and debugging purposes. A feedback is provided to the workflows
+  on various aspects of the ingest activities. The feedback is useful for the workflows to adjust their
+  behavior and to ensure the quality of the data being ingested.
+
+  - To get further info on this subject, see the sections Error reporting (TODO) and Using MySQL warnings (TODO).
+    In addition, the API provides REST services for obtaining metadata on the state of catalogs, tables, distributed
+    transactions, contribution requests, the progress of the requested operations, etc.
+
+**What the Ingest System won't do**:
+
+- As per its current implementation (which may change in the future) it will not automatically partition
+  input files. This task is expected to be the responsibility of the ingest workflows.
+
+- It will not (with the very small exception of adding an extra leading column ``qserv_trans_id`` required by
+  the implementation of the new system) pre-process the input ``CSV`` files sent to the Ingest Data Servers
+  for loading into tables.  It's up to the workflows to sanitize the input data and to make them ready to be
+  ingested into Qserv.
 
 
-.. list-table:: Title
-   :widths: 25 25 50
-   :header-rows: 1
+More documents on the requirements and the low-level technical details of its implementation (unless it's needed
+for the purposes of the document's goals) can be found elsewhere.
 
-   * - Heading row 1, column 1
-     - Heading row 1, column 2
-     - Heading row 1, column 3
-   * - Row 1, column 1
-     -
-     - Row 1, column 3
-   * - Row 2, column 1
-     - Row 2, column 2
-     - Row 2, column 3
+It's recommended to read the document sequentially. Most ideas presented in the document are introduced in section
+An example of a simple workflow (TODO: add a link to the section). The section is followed by a few more sections
+covering advanced topics (TODO: add a link to the section). The API Reference section at the very end of the document
+should be used to find complete descriptions of the REST services and tools mentioned in the document.
diff --git a/doc/ingest/api/reference/index.rst b/doc/ingest/api/reference/index.rst
@@ -0,0 +1,10 @@
+
+######################
+Ingest API Reference
+######################
+
+.. toctree::
+   :maxdepth: 4
+
+   rest/index
+   tools