doc/design: platform v2, logical architecture of the query/control layer #23543

aljoscha · 2023-11-29T16:41:19Z

Rendered

Motivation

Part of MaterializeInc/database-issues#6316

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

doc/developer/design/20231127_decoupled_isolated_coordinator.md

danhhz · 2023-12-04T18:27:14Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+CATALOG state (see Background section, above).
+
+When processing client requests, ADAPTER uses TIMESTAMP ORACLE to get the
+latest read timestamp for CATALOG, then fetches a CATALOG snapshot as of at


then fetches a CATALOG snapshot as of at least that timestamp

This makes it sound like the catalog's upper has to be advanced continually as time passes, and not just when the catalog changes. Is that right? If so, we might want to address why we're not making the catalog upper lazy in the alternatives section (I'm certainly curious)

I've been assuming it would advance only when there are changes, and only when there are changes would the write timtestamp (at the oracel) be bumped. So most of the time you would get the same read timestamp for the catalog

Okay cool. This bit I quoted sound a bit like it contradicts that, so probably worth massaging the language here a bit

benesch · 2023-12-25T02:17:33Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+  representation of the catalog up to date with changes. However, it is
+  logically a different operation.


I'm confused about the need for SnapshotAt! It seems to be a strict subset of SubscribeAt, because you could instead call SubscribeAt, wait for the initial snapshot to complete, and then terminate the subscription.

Yep! I mention right here that it would be implemented in terms of SubscribeAt. I chose to keep it as a separate command (along with a description) here because I refer to "taking a catalog snapshot" below, where I describe how user queries are processed, plus I thought it might be a nice concept to keep in mind in general.

But I'm happy to remove it here if you think it doesn't add anything.

benesch · 2023-12-25T02:24:54Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+These changes make it so that the controller knows up to which CATALOG
+timestamp it has received commands and to ensure that peeks (or other
+query-like operations) are only served once we know that controller state is
+_at least_ up to date with the catalog state as of which the peek was
+processed. Note that we say the Controller has to be _at least_ up to date with
+the `catalog_as_of`, it is okay for its catalog frontier to be beyond that.
+This has the effect that in-flight peeks can end up in a situation where a
+Controller doesn't have the collection that we are trying to query, and we
+therefore have to report back an error. This is in line with current
+Materialize behavior.


I have trouble reasoning about the correctness of this. If the controller were required to use the catalog state at exactly as_of, that would make sense to me. As it stands, I think the correctness relies on some fact like collection IDs are immutable, and so the only possible three states for an ID are "collection as yet unknown", "collection exists with known definition", and "collection dropped", and so waiting for the catalog as_of to catch up eliminates the first state from consideration, and we're fine with either of the other states.

But what if collection definitions can change over time? I'm thinking about Parker's forthcoming work on ALTER TABLE? I'm not sure how this presents to the formal model of Materialize. Does ALTER TABLE ... ADD COLUMN in-place mutate the definition of an existing collection? Or does it mint a new collection with a new ID? (Certainly the current plans are to use in-place mutate the existing definition of the global ID, but we could say that a collection is logically identified by (global_id, revision) and the revision is bumped after an ALTER TABLE.)

Anyway, the point here is that a peek for SELECT * FROM t might violate serializability in a post-ALTER TABLE world if the only constraint on when the peek executes is that the peek's catalog_as_of is less than or equal to the controller's catalog. The peek also needs to be executed before the controller learns of any future alterations to t.

That's all braindump-y. Let me know if anything is unclear and I can take another stab at explaining my thinking.

It's quite clear! I decided to not write about it here because we would already have those problems when collections are mutated in place: peeks go through stages concurrently with other coordinator work happening. It can, for example, happen that a peek is received, validated, optimized, etc., but the peeked collection is dropped concurrently. Today, we simply return an error when that happens, but in a post-ALTER TABLE word this would lead to more subtle problems so that work would already have to consider those cases.

I was assuming too many implicit contracts here and should expand, what do you think?

My thinking was roughly:

We already have to have a way of dealing with concurrency for these kinds of situations, because peeks are processed concurrently.

A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.

-> We need to fail gracefully when state that we expect to be there is not there anymore. This includes cases where a collection is dropped, and with ALTER TABLE that would naturally extend to the revision of a collection that we expect being "dropped".

This is where we check for collections being removed:

materialize/src/adapter/src/coord.rs

Line 474 in 0b7b848

// It is sufficient to check that all the source_ids still exist because we assume:

The set of assumptions mentioned there matches what I had assumed. But seems we have to be careful and make that more explicit.

A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.

Oh this seems like a good thing to shake out! They already do establish a read hold on the data collection. And perhaps establishing a read hold on the catalog collection is not so crazy? Idk! It depends what goes wrong if you keep a hold on the catalog collection. Holding up a drop dataflow command until all outstanding peeks are drained seems potentially fine. Refusing to answer new peeks because an ALTER TABLE is stalled out on an ancient peek seems a lot less fine.

It doesn't have to be all or nothing, either. One of the things we wish we had in the controller protocol is the ability to time out replicas that were down for too long, for some tweakable definition of "too long." Maybe something similar makes sense here, where peeks by default hold back the catalog revision, but the controller has license to abort them if they're taking too long, or as a matter of policy if certain commands come in (e.g., we decide that ALTER TABLE gets to force kill all outstanding peeks).

Oh this seems like a good thing to shake out!

agreed! this is subtle, though, so let's discuss in our next sync and then report back here?

Sounds good!

But what if collection definitions can change over time?

We discussed this a bit in the scope of ALTER MATERIALIZED VIEW and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?

I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId, we could do what @benesch suggested above and use a revision ID instead.

benesch · 2023-12-25T02:39:00Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+DML-style queries) are timestamped with the timestamp (in the catalog timeline)
+as of which the query was processed. This ensure that these ephemeral
+controller commands are consistent with what the controller must know as of
+that timestamp.


One thing I don't see discussed here is the nondeterminism that can be caused by controller commands. Specifically thinking of how the storage controller today is responsible for minting shard IDs. Is this the moment where we want to declare that everything the controller does should be a perfectly deterministic function of its inputs?

I'm 100% of the opinion that having one durable store, that you write to atomically is easier to reason about than having multiple of those (the current situation). And think this might be required to do 0dt upgrades.

I'm less certain we "really" need it here already: the controller makes the GlobalId -> ShardId mapping definite before it returns from an operation.

This assumes, though, that the mapping (or anything else really) that the controller writes down cannot change. I feel this is similar to what you outlined in #23543 (comment). Then again, though, it seems we also need a revision-like mechanism if we ever want to allow changing the mapping (or any other controller state we keep in shared durable storage).

doc/developer/design/20231127_decoupled_isolated_coordinator.md

teskje · 2024-01-22T12:28:21Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+   and therefore would not survive a restart of `environmentd`.
+
+For the first category, the Coordinator acts as a deterministic translator of
+Catalog contents or changes to the Catalog to Controller commands.


When you say "deterministic" does that only assume the catalog state as an input? Speaking for Compute, at least the various dataflow timestamps (as_of, until) are not derived from the catalog but from the persist frontiers of dependency (and sometimes dependent) collections.

I had the catalog in mind, yes, but your comment is valid! Even today, the state of frontiers is not deterministic, it depends on the ordering in which the coordinator receives upgrades from (storage) clusters and applies since downgrades. So I'd say the approach still works even when controllers are each driven around in their own task (or process, in the long term). What do you think?

What I'm unsure about is in how far adapter needs to be able to select specific as_ofs for dataflows created from catalog state.

It definitely needs to select as_ofs for SELECTs and SUBSCRIBEs, but these are ephemeral commands and not derived from the catalog state anyway.

It needs to select as_ofs for REFRESH MVs too. I'm not entirely sure, but I think the REFRESH options are stored in the catalog in some form, so we can derive the as_of from those.

Apart from that, I think adapter mostly just selects the as_of as the least possible read time across the inputs and doesn't really care about the specific value, so for those cases we could just move as_of selection to the controller.

Something that might be annoying is timestamp selection during bootstrap, as that has to take into account the frontiers of dependencies but also of dependent materialized views (see bootstrap_index_as_of). Moving that to the controller means that the controller can't immediately forward "create dataflow" commands to its replicas, but has to wait until it has seen the entire catalog snapshot, to make sure that as_ofs are adjusted correctly.

teskje · 2024-01-22T12:41:19Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+spawned by the `INSERT` are worked off. And the fact that Catalog state (and
+other state, including talking to Controller(s)) can only be modified from one
+"thread" make it so that the Peek will observe the correct state when it is
+being processed.


Is it also important for controller responses to have some specific relative priority or does it only matter for internal/external commands?

I think it's not important because the order in which we receive today is already not deterministic.

That's also my intuition but I got myself confused by this earlier statement:

As we will see later, an important part of the current design is that these
messages have priorities. The coordinator has to work off all Internal Commands
(1.) before processing Controller Responses (2.), and so on for 2. and 3.

teskje · 2024-01-22T13:43:49Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+  (other than the sequencing of the event loop) of ensuring consistency.
+- Deterministic Controller Commands don't _have_ to be delivered by the
+  Coordinator. Controller(s) could learn about them directly from listening to
+  Catalog changes.


This might be true if we can leave timestamp selection for DDL (creating indexes or MVs) to the controller. The controller has access to all the dependency frontier information, so it would be able to invent as-ofs. I'm just not sure if there is a reason the coordinator needs to be able to control these times.

Alternatively, this might be true if the selected dataflow timestamps were part of the catalog state.

teskje · 2024-01-22T14:27:12Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+These changes make it so that the controller knows up to which CATALOG
+timestamp it has received commands and to ensure that peeks (or other
+query-like operations) are only served once we know that controller state is
+_at least_ up to date with the catalog state as of which the peek was
+processed. Note that we say the Controller has to be _at least_ up to date with
+the `catalog_as_of`, it is okay for its catalog frontier to be beyond that.
+This has the effect that in-flight peeks can end up in a situation where a
+Controller doesn't have the collection that we are trying to query, and we
+therefore have to report back an error. This is in line with current
+Materialize behavior.


But what if collection definitions can change over time?

We discussed this a bit in the scope of ALTER MATERIALIZED VIEW and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?

I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId, we could do what @benesch suggested above and use a revision ID instead.

doc/developer/design/20231127_decoupled_isolated_coordinator.md

teskje · 2024-01-22T14:43:10Z

doc/developer/design/20231127_decoupled_isolated_coordinator.md

+Once the controllers are decoupled from the Coordinator via the CATALOG pTVC,
+we can start moving them out into their own processes or move them along-side
+the cluster/replica processes. Similarly, we can start moving query processing
+to other processes.


Is my understanding correct that in this end state we still will have only a single controller per cluster/replica? I.e. we don't need to worry about replicas being able to talk to different controllers at the same time?

Yes! But I haven't yet fully worked out where that controller will live. The tricky thing is that there can be multiple replicas, and they each can comprise multiple processes.

Couple options:

A separate controllerd process? That would be easiest to reason about. But cloud might not like that.

It lives inside one specific replica process. Feels icky!

It's a distributed thing that lives in all replica processes and internally, somehow, does coordination.

This is one of the last things for Phase III (fully distributed) that I really worry about, because I don't have a good answer yet. 🙈 Good thing we don't need to think about it for Phase II because we still have that one central environmentd process...

One thing that's maybe helpful to think about is how bad do we need the controller's to be healthy in the Phase III world. If we let the controller run in a replica process, or all replica processes, it will become slow when the replicas are overloaded, and unavailable when the replicas are OOMing (or the cluster has not replicas?).

…l layer Part of the platform v2 / use-case isolation EPIC.

danhhz self-requested a review November 29, 2023 17:11

danhhz reviewed Nov 30, 2023

View reviewed changes

danhhz reviewed Dec 4, 2023

View reviewed changes

benesch reviewed Dec 25, 2023

View reviewed changes

teskje reviewed Jan 22, 2024

View reviewed changes

aljoscha force-pushed the design-decoupled-coordinator branch from 56fdf3d to 34ab456 Compare January 26, 2024 12:51

aljoscha changed the title ~~Platform v2: Add design for decoupled and isolated Coordinator~~ doc/design: design doc for decoupled and isolated Coordinator Feb 1, 2024

aljoscha mentioned this pull request Feb 1, 2024

doc/design: platform v2, physical distributed architecture of the query/control layer #24879

Draft

5 tasks

aljoscha changed the title ~~doc/design: design doc for decoupled and isolated Coordinator~~ doc/design: platform v2, logical architecture of the query/control layer Feb 2, 2024

aljoscha force-pushed the design-decoupled-coordinator branch from f2683be to 858e6b7 Compare October 18, 2024 11:04

aljoscha marked this pull request as ready for review October 18, 2024 11:04

design: logical architecture of a fast/scalable/isolated query/contro…

3a597fb

…l layer Part of the platform v2 / use-case isolation EPIC.

aljoscha force-pushed the design-decoupled-coordinator branch from 858e6b7 to 3a597fb Compare October 18, 2024 11:06

antiguru approved these changes Oct 18, 2024

View reviewed changes

aljoscha merged commit a96f43e into MaterializeInc:main Oct 18, 2024
7 checks passed

aljoscha deleted the design-decoupled-coordinator branch October 18, 2024 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc/design: platform v2, logical architecture of the query/control layer #23543

doc/design: platform v2, logical architecture of the query/control layer #23543

aljoscha commented Nov 29, 2023 •

edited

Loading

danhhz Dec 4, 2023

aljoscha Dec 4, 2023

danhhz Dec 4, 2023 •

edited

Loading

benesch Dec 25, 2023

aljoscha Dec 29, 2023

benesch Dec 25, 2023

aljoscha Dec 29, 2023

aljoscha Dec 29, 2023

benesch Jan 2, 2024

aljoscha Jan 2, 2024

benesch Jan 2, 2024

teskje Jan 22, 2024

benesch Dec 25, 2023

aljoscha Dec 29, 2023

teskje Jan 22, 2024

aljoscha Jan 26, 2024

teskje Jan 26, 2024

teskje Jan 22, 2024

aljoscha Jan 26, 2024

teskje Jan 26, 2024

teskje Jan 22, 2024

teskje Jan 22, 2024

teskje Jan 22, 2024

aljoscha Jan 26, 2024

teskje Jan 26, 2024

		representation of the catalog up to date with changes. However, it is
		logically a different operation.

doc/design: platform v2, logical architecture of the query/control layer #23543

doc/design: platform v2, logical architecture of the query/control layer #23543

Conversation

aljoscha commented Nov 29, 2023 • edited Loading

Motivation

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhhz Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aljoscha commented Nov 29, 2023 •

edited

Loading

danhhz Dec 4, 2023 •

edited

Loading