-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc/design: platform v2, logical architecture of the query/control layer #23543
doc/design: platform v2, logical architecture of the query/control layer #23543
Conversation
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
CATALOG state (see Background section, above). | ||
|
||
When processing client requests, ADAPTER uses TIMESTAMP ORACLE to get the | ||
latest read timestamp for CATALOG, then fetches a CATALOG snapshot as of at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then fetches a CATALOG snapshot as of at least that timestamp
This makes it sound like the catalog's upper has to be advanced continually as time passes, and not just when the catalog changes. Is that right? If so, we might want to address why we're not making the catalog upper lazy in the alternatives section (I'm certainly curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been assuming it would advance only when there are changes, and only when there are changes would the write timtestamp (at the oracel) be bumped. So most of the time you would get the same read timestamp for the catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool. This bit I quoted sound a bit like it contradicts that, so probably worth massaging the language here a bit
representation of the catalog up to date with changes. However, it is | ||
logically a different operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the need for SnapshotAt
! It seems to be a strict subset of SubscribeAt
, because you could instead call SubscribeAt
, wait for the initial snapshot to complete, and then terminate the subscription.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I mention right here that it would be implemented in terms of SubscribeAt
. I chose to keep it as a separate command (along with a description) here because I refer to "taking a catalog snapshot" below, where I describe how user queries are processed, plus I thought it might be a nice concept to keep in mind in general.
But I'm happy to remove it here if you think it doesn't add anything.
These changes make it so that the controller knows up to which CATALOG | ||
timestamp it has received commands and to ensure that peeks (or other | ||
query-like operations) are only served once we know that controller state is | ||
_at least_ up to date with the catalog state as of which the peek was | ||
processed. Note that we say the Controller has to be _at least_ up to date with | ||
the `catalog_as_of`, it is okay for its catalog frontier to be beyond that. | ||
This has the effect that in-flight peeks can end up in a situation where a | ||
Controller doesn't have the collection that we are trying to query, and we | ||
therefore have to report back an error. This is in line with current | ||
Materialize behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble reasoning about the correctness of this. If the controller were required to use the catalog state at exactly as_of
, that would make sense to me. As it stands, I think the correctness relies on some fact like collection IDs are immutable, and so the only possible three states for an ID are "collection as yet unknown", "collection exists with known definition", and "collection dropped", and so waiting for the catalog as_of
to catch up eliminates the first state from consideration, and we're fine with either of the other states.
But what if collection definitions can change over time? I'm thinking about Parker's forthcoming work on ALTER TABLE
? I'm not sure how this presents to the formal model of Materialize. Does ALTER TABLE ... ADD COLUMN
in-place mutate the definition of an existing collection? Or does it mint a new collection with a new ID? (Certainly the current plans are to use in-place mutate the existing definition of the global ID, but we could say that a collection is logically identified by (global_id, revision)
and the revision is bumped after an ALTER TABLE
.)
Anyway, the point here is that a peek for SELECT * FROM t
might violate serializability in a post-ALTER TABLE
world if the only constraint on when the peek executes is that the peek's catalog_as_of
is less than or equal to the controller's catalog. The peek also needs to be executed before the controller learns of any future alterations to t
.
That's all braindump-y. Let me know if anything is unclear and I can take another stab at explaining my thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite clear! I decided to not write about it here because we would already have those problems when collections are mutated in place: peeks go through stages concurrently with other coordinator work happening. It can, for example, happen that a peek is received, validated, optimized, etc., but the peeked collection is dropped concurrently. Today, we simply return an error when that happens, but in a post-ALTER TABLE
word this would lead to more subtle problems so that work would already have to consider those cases.
I was assuming too many implicit contracts here and should expand, what do you think?
My thinking was roughly:
- We already have to have a way of dealing with concurrency for these kinds of situations, because peeks are processed concurrently.
- A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.
- -> We need to fail gracefully when state that we expect to be there is not there anymore. This includes cases where a collection is dropped, and with
ALTER TABLE
that would naturally extend to the revision of a collection that we expect being "dropped".
- -> We need to fail gracefully when state that we expect to be there is not there anymore. This includes cases where a collection is dropped, and with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we check for collections being removed:
materialize/src/adapter/src/coord.rs
Line 474 in 0b7b848
// It is sufficient to check that all the source_ids still exist because we assume: |
The set of assumptions mentioned there matches what I had assumed. But seems we have to be careful and make that more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A peek cannot be allowed to hold up progress, it would require putting in place holds or locks, which seems antithetical to the idea of isolated peeks.
Oh this seems like a good thing to shake out! They already do establish a read hold on the data collection. And perhaps establishing a read hold on the catalog collection is not so crazy? Idk! It depends what goes wrong if you keep a hold on the catalog collection. Holding up a drop dataflow command until all outstanding peeks are drained seems potentially fine. Refusing to answer new peeks because an ALTER TABLE
is stalled out on an ancient peek seems a lot less fine.
It doesn't have to be all or nothing, either. One of the things we wish we had in the controller protocol is the ability to time out replicas that were down for too long, for some tweakable definition of "too long." Maybe something similar makes sense here, where peeks by default hold back the catalog revision, but the controller has license to abort them if they're taking too long, or as a matter of policy if certain commands come in (e.g., we decide that ALTER TABLE
gets to force kill all outstanding peeks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this seems like a good thing to shake out!
agreed! this is subtle, though, so let's discuss in our next sync and then report back here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if collection definitions can change over time?
We discussed this a bit in the scope of ALTER MATERIALIZED VIEW
and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?
I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId
, we could do what @benesch suggested above and use a revision ID instead.
DML-style queries) are timestamped with the timestamp (in the catalog timeline) | ||
as of which the query was processed. This ensure that these ephemeral | ||
controller commands are consistent with what the controller must know as of | ||
that timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't see discussed here is the nondeterminism that can be caused by controller commands. Specifically thinking of how the storage controller today is responsible for minting shard IDs. Is this the moment where we want to declare that everything the controller does should be a perfectly deterministic function of its inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 100% of the opinion that having one durable store, that you write to atomically is easier to reason about than having multiple of those (the current situation). And think this might be required to do 0dt upgrades.
I'm less certain we "really" need it here already: the controller makes the GlobalId -> ShardId
mapping definite before it returns from an operation.
This assumes, though, that the mapping (or anything else really) that the controller writes down cannot change. I feel this is similar to what you outlined in #23543 (comment). Then again, though, it seems we also need a revision-like mechanism if we ever want to allow changing the mapping (or any other controller state we keep in shared durable storage).
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
and therefore would not survive a restart of `environmentd`. | ||
|
||
For the first category, the Coordinator acts as a deterministic translator of | ||
Catalog contents or changes to the Catalog to Controller commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "deterministic" does that only assume the catalog state as an input? Speaking for Compute, at least the various dataflow timestamps (as_of
, until
) are not derived from the catalog but from the persist frontiers of dependency (and sometimes dependent) collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the catalog in mind, yes, but your comment is valid! Even today, the state of frontiers is not deterministic, it depends on the ordering in which the coordinator receives upgrades from (storage) clusters and applies since downgrades. So I'd say the approach still works even when controllers are each driven around in their own task (or process, in the long term). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm unsure about is in how far adapter needs to be able to select specific as_of
s for dataflows created from catalog state.
- It definitely needs to select
as_of
s forSELECT
s andSUBSCRIBE
s, but these are ephemeral commands and not derived from the catalog state anyway. - It needs to select
as_of
s forREFRESH
MVs too. I'm not entirely sure, but I think theREFRESH
options are stored in the catalog in some form, so we can derive theas_of
from those. - Apart from that, I think adapter mostly just selects the
as_of
as the least possible read time across the inputs and doesn't really care about the specific value, so for those cases we could just moveas_of
selection to the controller.
Something that might be annoying is timestamp selection during bootstrap, as that has to take into account the frontiers of dependencies but also of dependent materialized views (see bootstrap_index_as_of
). Moving that to the controller means that the controller can't immediately forward "create dataflow" commands to its replicas, but has to wait until it has seen the entire catalog snapshot, to make sure that as_of
s are adjusted correctly.
spawned by the `INSERT` are worked off. And the fact that Catalog state (and | ||
other state, including talking to Controller(s)) can only be modified from one | ||
"thread" make it so that the Peek will observe the correct state when it is | ||
being processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also important for controller responses to have some specific relative priority or does it only matter for internal/external commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not important because the order in which we receive today is already not deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also my intuition but I got myself confused by this earlier statement:
As we will see later, an important part of the current design is that these
messages have priorities. The coordinator has to work off all Internal Commands
(1.) before processing Controller Responses (2.), and so on for 2. and 3.
(other than the sequencing of the event loop) of ensuring consistency. | ||
- Deterministic Controller Commands don't _have_ to be delivered by the | ||
Coordinator. Controller(s) could learn about them directly from listening to | ||
Catalog changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be true if we can leave timestamp selection for DDL (creating indexes or MVs) to the controller. The controller has access to all the dependency frontier information, so it would be able to invent as-ofs. I'm just not sure if there is a reason the coordinator needs to be able to control these times.
Alternatively, this might be true if the selected dataflow timestamps were part of the catalog state.
These changes make it so that the controller knows up to which CATALOG | ||
timestamp it has received commands and to ensure that peeks (or other | ||
query-like operations) are only served once we know that controller state is | ||
_at least_ up to date with the catalog state as of which the peek was | ||
processed. Note that we say the Controller has to be _at least_ up to date with | ||
the `catalog_as_of`, it is okay for its catalog frontier to be beyond that. | ||
This has the effect that in-flight peeks can end up in a situation where a | ||
Controller doesn't have the collection that we are trying to query, and we | ||
therefore have to report back an error. This is in line with current | ||
Materialize behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if collection definitions can change over time?
We discussed this a bit in the scope of ALTER MATERIALIZED VIEW
and arrived at the conclusion that having collection definitions immutable in the controller is very much preferable, as it simplifies things enormously. For example, if dataflow definitions can change in-place, you need to switch them out on the replicas. Do you wait for all replicas to have confirmed the switch? What if some replicas are unresponsive? Do you instead continue in a state where some replicas have switched the dataflow but some still send responses for the old one?
I don't think it would be too hard to keep collections immutable in the controller. If we can't re-associate the SQL object with a different GlobalId
, we could do what @benesch suggested above and use a revision ID instead.
doc/developer/design/20231127_decoupled_isolated_coordinator.md
Outdated
Show resolved
Hide resolved
Once the controllers are decoupled from the Coordinator via the CATALOG pTVC, | ||
we can start moving them out into their own processes or move them along-side | ||
the cluster/replica processes. Similarly, we can start moving query processing | ||
to other processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that in this end state we still will have only a single controller per cluster/replica? I.e. we don't need to worry about replicas being able to talk to different controllers at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! But I haven't yet fully worked out where that controller will live. The tricky thing is that there can be multiple replicas, and they each can comprise multiple processes.
Couple options:
- A separate
controllerd
process? That would be easiest to reason about. But cloud might not like that. - It lives inside one specific replica process. Feels icky!
- It's a distributed thing that lives in all replica processes and internally, somehow, does coordination.
This is one of the last things for Phase III (fully distributed) that I really worry about, because I don't have a good answer yet. 🙈 Good thing we don't need to think about it for Phase II because we still have that one central environmentd
process...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's maybe helpful to think about is how bad do we need the controller's to be healthy in the Phase III world. If we let the controller run in a replica process, or all replica processes, it will become slow when the replicas are overloaded, and unavailable when the replicas are OOMing (or the cluster has not replicas?).
56fdf3d
to
34ab456
Compare
f2683be
to
858e6b7
Compare
…l layer Part of the platform v2 / use-case isolation EPIC.
858e6b7
to
3a597fb
Compare
Rendered
Motivation
Part of MaterializeInc/database-issues#6316
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.