Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/design: platform v2, logical architecture of the query/control layer #23543

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Nov 29, 2023

Rendered

Motivation

Part of MaterializeInc/database-issues#6316

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@danhhz danhhz self-requested a review November 29, 2023 17:11
CATALOG state (see Background section, above).

When processing client requests, ADAPTER uses TIMESTAMP ORACLE to get the
latest read timestamp for CATALOG, then fetches a CATALOG snapshot as of at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then fetches a CATALOG snapshot as of at least that timestamp

This makes it sound like the catalog's upper has to be advanced continually as time passes, and not just when the catalog changes. Is that right? If so, we might want to address why we're not making the catalog upper lazy in the alternatives section (I'm certainly curious)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been assuming it would advance only when there are changes, and only when there are changes would the write timtestamp (at the oracel) be bumped. So most of the time you would get the same read timestamp for the catalog

Copy link
Contributor

@danhhz danhhz Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool. This bit I quoted sound a bit like it contradicts that, so probably worth massaging the language here a bit

Comment on lines 337 to 338
representation of the catalog up to date with changes. However, it is
logically a different operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the need for SnapshotAt! It seems to be a strict subset of SubscribeAt, because you could instead call SubscribeAt, wait for the initial snapshot to complete, and then terminate the subscription.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I mention right here that it would be implemented in terms of SubscribeAt. I chose to keep it as a separate command (along with a description) here because I refer to "taking a catalog snapshot" below, where I describe how user queries are processed, plus I thought it might be a nice concept to keep in mind in general.

But I'm happy to remove it here if you think it doesn't add anything.

Comment on lines 355 to 364
These changes make it so that the controller knows up to which CATALOG
timestamp it has received commands and to ensure that peeks (or other
query-like operations) are only served once we know that controller state is
_at least_ up to date with the catalog state as of which the peek was
processed. Note that we say the Controller has to be _at least_ up to date with
the `catalog_as_of`, it is okay for its catalog frontier to be beyond that.
This has the effect that in-flight peeks can end up in a situation where a
Controller doesn't have the collection that we are trying to query, and we
therefore have to report back an error. This is in line with current
Materialize behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble reasoning about the correctness of this. If the controller were required to use the catalog state at exactly as_of, that would make sense to me. As it stands, I think the correctness relies on some fact like collection IDs are immutable, and so the only possible three states for an ID are "collection as yet unknown", "collection exists with known definition", and "collection dropped", and so waiting for the catalog as_of to catch up eliminates the first state from consideration, and we're fine with either of the other states.

But what if collection definitions can change over time? I'm thinking about Parker's forthcoming work on ALTER TABLE? I'm not sure how this presents to the formal model of Materialize. Does ALTER TABLE ... ADD COLUMN in-place mutate the definition of an existing collection? Or does it mint a new collection with a new ID? (Certainly the current plans are to use in-place mutate the existing definition of the global ID, but we could say that a collection is logically identified by (global_id, revision) and the revision is bumped after an ALTER TABLE.)

Anyway, the point here is that a peek for SELECT * FROM t might violate serializability in a post-ALTER TABLE world if the only constraint on when the peek executes is that the peek's catalog_as_of is less than or equal to the controller's catalog. The peek also needs to be executed before the controller learns of any future alterations to t.

That's all braindump-y. Let me know if anything is unclear and I can take another stab at explaining my thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite clear! I decided to not write about it here because we would already have those problems when collections are mutated in place: peeks go through stages concurrently with other coordinator work happening. It can, for example, happen that a peek is received, validated, optimized, etc., but the peeked collection is dropped concurrently. Today, we simply return an error when that happens, but in a post-ALTER TABLE word this would lead to more subtle problems so that work would already have to consider those cases.

I was assuming too many implicit contracts here and should expand, what do you think?

My thinking was roughly:

  • We already have to have a way of dealing with concurrency for these kinds of situations, because peeks are processed concurrently.
  • A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.
    • -> We need to fail gracefully when state that we expect to be there is not there anymore. This includes cases where a collection is dropped, and with ALTER TABLE that would naturally extend to the revision of a collection that we expect being "dropped".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we check for collections being removed:

// It is sufficient to check that all the source_ids still exist because we assume:

The set of assumptions mentioned there matches what I had assumed. But seems we have to be careful and make that more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.

Oh this seems like a good thing to shake out! They already do establish a read hold on the data collection. And perhaps establishing a read hold on the catalog collection is not so crazy? Idk! It depends what goes wrong if you keep a hold on the catalog collection. Holding up a drop dataflow command until all outstanding peeks are drained seems potentially fine. Refusing to answer new peeks because an ALTER TABLE is stalled out on an ancient peek seems a lot less fine.

It doesn't have to be all or nothing, either. One of the things we wish we had in the controller protocol is the ability to time out replicas that were down for too long, for some tweakable definition of "too long." Maybe something similar makes sense here, where peeks by default hold back the catalog revision, but the controller has license to abort them if they're taking too long, or as a matter of policy if certain commands come in (e.g., we decide that ALTER TABLE gets to force kill all outstanding peeks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this seems like a good thing to shake out!

agreed! this is subtle, though, so let's discuss in our next sync and then report back here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if collection definitions can change over time?

We discussed this a bit in the scope of ALTER MATERIALIZED VIEW and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?

I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId, we could do what @benesch suggested above and use a revision ID instead.

DML-style queries) are timestamped with the timestamp (in the catalog timeline)
as of which the query was processed. This ensure that these ephemeral
controller commands are consistent with what the controller must know as of
that timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't see discussed here is the nondeterminism that can be caused by controller commands. Specifically thinking of how the storage controller today is responsible for minting shard IDs. Is this the moment where we want to declare that everything the controller does should be a perfectly deterministic function of its inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% of the opinion that having one durable store, that you write to atomically is easier to reason about than having multiple of those (the current situation). And think this might be required to do 0dt upgrades.

I'm less certain we "really" need it here already: the controller makes the GlobalId -> ShardId mapping definite before it returns from an operation.

This assumes, though, that the mapping (or anything else really) that the controller writes down cannot change. I feel this is similar to what you outlined in #23543 (comment). Then again, though, it seems we also need a revision-like mechanism if we ever want to allow changing the mapping (or any other controller state we keep in shared durable storage).

and therefore would not survive a restart of `environmentd`.

For the first category, the Coordinator acts as a deterministic translator of
Catalog contents or changes to the Catalog to Controller commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "deterministic" does that only assume the catalog state as an input? Speaking for Compute, at least the various dataflow timestamps (as_of, until) are not derived from the catalog but from the persist frontiers of dependency (and sometimes dependent) collections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the catalog in mind, yes, but your comment is valid! Even today, the state of frontiers is not deterministic, it depends on the ordering in which the coordinator receives upgrades from (storage) clusters and applies since downgrades. So I'd say the approach still works even when controllers are each driven around in their own task (or process, in the long term). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm unsure about is in how far adapter needs to be able to select specific as_ofs for dataflows created from catalog state.

  • It definitely needs to select as_ofs for SELECTs and SUBSCRIBEs, but these are ephemeral commands and not derived from the catalog state anyway.
  • It needs to select as_ofs for REFRESH MVs too. I'm not entirely sure, but I think the REFRESH options are stored in the catalog in some form, so we can derive the as_of from those.
  • Apart from that, I think adapter mostly just selects the as_of as the least possible read time across the inputs and doesn't really care about the specific value, so for those cases we could just move as_of selection to the controller.

Something that might be annoying is timestamp selection during bootstrap, as that has to take into account the frontiers of dependencies but also of dependent materialized views (see bootstrap_index_as_of). Moving that to the controller means that the controller can't immediately forward "create dataflow" commands to its replicas, but has to wait until it has seen the entire catalog snapshot, to make sure that as_ofs are adjusted correctly.

spawned by the `INSERT` are worked off. And the fact that Catalog state (and
other state, including talking to Controller(s)) can only be modified from one
"thread" make it so that the Peek will observe the correct state when it is
being processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also important for controller responses to have some specific relative priority or does it only matter for internal/external commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not important because the order in which we receive today is already not deterministic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also my intuition but I got myself confused by this earlier statement:

As we will see later, an important part of the current design is that these
messages have priorities. The coordinator has to work off all Internal Commands
(1.) before processing Controller Responses (2.), and so on for 2. and 3.

(other than the sequencing of the event loop) of ensuring consistency.
- Deterministic Controller Commands don't _have_ to be delivered by the
Coordinator. Controller(s) could learn about them directly from listening to
Catalog changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be true if we can leave timestamp selection for DDL (creating indexes or MVs) to the controller. The controller has access to all the dependency frontier information, so it would be able to invent as-ofs. I'm just not sure if there is a reason the coordinator needs to be able to control these times.

Alternatively, this might be true if the selected dataflow timestamps were part of the catalog state.

Comment on lines 355 to 364
These changes make it so that the controller knows up to which CATALOG
timestamp it has received commands and to ensure that peeks (or other
query-like operations) are only served once we know that controller state is
_at least_ up to date with the catalog state as of which the peek was
processed. Note that we say the Controller has to be _at least_ up to date with
the `catalog_as_of`, it is okay for its catalog frontier to be beyond that.
This has the effect that in-flight peeks can end up in a situation where a
Controller doesn't have the collection that we are trying to query, and we
therefore have to report back an error. This is in line with current
Materialize behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if collection definitions can change over time?

We discussed this a bit in the scope of ALTER MATERIALIZED VIEW and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?

I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId, we could do what @benesch suggested above and use a revision ID instead.

Once the controllers are decoupled from the Coordinator via the CATALOG pTVC,
we can start moving them out into their own processes or move them along-side
the cluster/replica processes. Similarly, we can start moving query processing
to other processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that in this end state we still will have only a single controller per cluster/replica? I.e. we don't need to worry about replicas being able to talk to different controllers at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! But I haven't yet fully worked out where that controller will live. The tricky thing is that there can be multiple replicas, and they each can comprise multiple processes.

Couple options:

  • A separate controllerd process? That would be easiest to reason about. But cloud might not like that.
  • It lives inside one specific replica process. Feels icky!
  • It's a distributed thing that lives in all replica processes and internally, somehow, does coordination.

This is one of the last things for Phase III (fully distributed) that I really worry about, because I don't have a good answer yet. 🙈 Good thing we don't need to think about it for Phase II because we still have that one central environmentd process...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's maybe helpful to think about is how bad do we need the controller's to be healthy in the Phase III world. If we let the controller run in a replica process, or all replica processes, it will become slow when the replicas are overloaded, and unavailable when the replicas are OOMing (or the cluster has not replicas?).

@aljoscha aljoscha force-pushed the design-decoupled-coordinator branch from 56fdf3d to 34ab456 Compare January 26, 2024 12:51
@aljoscha aljoscha changed the title Platform v2: Add design for decoupled and isolated Coordinator doc/design: design doc for decoupled and isolated Coordinator Feb 1, 2024
@aljoscha aljoscha changed the title doc/design: design doc for decoupled and isolated Coordinator doc/design: platform v2, logical architecture of the query/control layer Feb 2, 2024
@aljoscha aljoscha force-pushed the design-decoupled-coordinator branch from f2683be to 858e6b7 Compare October 18, 2024 11:04
@aljoscha aljoscha marked this pull request as ready for review October 18, 2024 11:04
…l layer

Part of the platform v2 / use-case isolation EPIC.
@aljoscha aljoscha force-pushed the design-decoupled-coordinator branch from 858e6b7 to 3a597fb Compare October 18, 2024 11:06
@aljoscha aljoscha merged commit a96f43e into MaterializeInc:main Oct 18, 2024
7 checks passed
@aljoscha aljoscha deleted the design-decoupled-coordinator branch October 18, 2024 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants