feat(cluster): support system-managed cluster #17051

zhang2014 · 2024-12-15T15:14:53Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

feat(cluster): support system-managed cluster

Add new sql statement

SHOW WAREHOUSES

SHOW ONLINE NODES

CREATE WAREHOUSE <warehouse> [(ASSIGN <node_size> NODES [FROM <node_group>] [, ...])] WITH [warehouse_size = <warehouse_size>]

DROP WAREHOUSE <warehouse>

RENAME WAREHOUSE <warehouse> TO <new_warehouse>

RESUME WAREHOUSE <warehouse>

SUSPEND WAREHOUSE <warehouse>

INSPECT WAREHOUSE <warehouse>

ALTER WAREHOUSE <warehouse> ADD CLUSTER <cluster> [(ASSIGN <node_size> NODES [FROM <node_group>] [, ...])] WITH [cluster_size = <cluster_size>]

ALTER WAREHOUSE <warehouse> DROP CLUSTER <cluster>

ALTER WAREHOUSE <warehouse> RENAME CLUSTER <cluster> TO <new_cluster>

ALTER WAREHOUSE <warehouse> ASSIGN NODES ( ASSIGN <node_size> NODES [FROM <node_group>] FOR <cluster> [, ...] )

ALTER WAREHOUSE <warehouse> UNASSIGN NODES ( UNASSIGN <node_size> NODES [FROM <node_group>] FOR <cluster> [, ...] )

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

what-the-diff · 2024-12-15T15:15:19Z

PR Summary

Expanded Exception Codes
New error codes were introduced to better handle specific situations that could arise during warehouse operations. These cover scenarios where no resources are available, where there's an attempt to create a duplicate warehouse or operate on an unknown warehouse, and where there's a conflict in warehouse operations.
Re-designed Cluster Management
This effort involves a significant overhaul of the initial design, replacing all references of ClusterApi and ClusterMgr to WarehouseApi and WarehouseMgr. The adapted interface ensures that the code now revolves around warehouse operations.
Removal of cluster_mgr.rs File
In line with the refactoring process, it was necessary to discard the cluster_mgr.rs file, indicating a shift from cluster management to warehouse management in operations.
Introduction of Warehouse-related Functionalities
To improve warehouse management, new methods got introduced. These include capabilities to create and delete warehouses seamlessly.
Quality Assurance Processes
New tests have been set up to validate warehouse operations, ensuring that the creation process works correctly and handles edge cases like attempting to create duplicate warehouses or operating when there are no resources available.
Modification of Node Lifecycles
Undergoing the new design involves altering test methods to fit the new WarehouseMgr. This ensures that the lifecycles of nodes and their interactions get handled appropriately under the new design.
Updated API Interaction
The interfaces with the ClusterDiscovery struct and its related methods were updated to now reference WarehouseApi and WarehouseMgr as part of the refactoring efforts.
Updated Function Signatures
To sync with the newly designed WarehouseApi, functions like create_provider and start_heartbeat have been modified. Further, provision for an additional seq parameter got included in the start_heartbeat function for more expansive use.
Advanced Heartbeat Logic
Improvements to the heartbeat sequence management lead to a more efficient process.
Enhanced Error Handling
The response mechanism following a heartbeat call was improved, refining the update process for match_seq in error scenarios.

drmingdrmer

The naming seems a bit inconsistent with the keys used in the meta-service and the corresponding variables. Aligning them might improve clarity.

Reviewed 1 of 9 files at r1, 10 of 61 files at r2.
Reviewable status: 11 of 62 files reviewed, 8 unresolved discussions (waiting on @zhang2014)

src/query/management/src/warehouse/warehouse_api.rs line 22 at r2 (raw file):

#[derive(serde::Serialize, serde::Deserialize, Clone, Eq, PartialEq, Debug)]
pub enum SelectedNode {
    Random(Option<String>),

What does the string in it mean? If the code isn't self-explanatory, it deserves a documentation comment.

src/query/management/src/warehouse/warehouse_api.rs line 29 at r2 (raw file):

#[derive(serde::Serialize, serde::Deserialize, Eq, PartialEq, Debug)]
pub enum WarehouseInfo {
    SelfManaged(String),

Need doc to explain the meaning of this string.

src/query/management/src/warehouse/warehouse_mgr.rs line 87 at r2 (raw file):

                escape_for_key(tenant)?,
            ),
        })

I'm kind of confused of the field names. A bit inconsistent with the name and the comment. GPT gives a refined version of this part. make sense?

It would be clearer to provide an explanation of the key format used under these prefixes.

Suggestion:

        Ok(WarehouseMgr {
            metastore,
            lift_time,
            // Prefix for all online nodes of the tenant
            online_nodes_key_prefix: format!(
                "{}/{}/online_nodes",
                WAREHOUSE_API_KEY_PREFIX,
                escape_for_key(tenant)?
            ),
            // Prefix for all online computing clusters of the tenant
            online_clusters_key_prefix: format!(
                "{}/{}/online_clusters",
                WAREHOUSE_API_KEY_PREFIX,
                escape_for_key(tenant)?
            ),
            // Prefix for all warehouses of the tenant (must ensure compatibility across all versions)
            warehouses_key_prefix: format!(
                "{}/v1/{}",
                WAREHOUSE_META_KEY_PREFIX,
                escape_for_key(tenant)?,
            ),
        })

src/query/management/src/warehouse/warehouse_mgr.rs line 138 at r2 (raw file):

        let mut txn = TxnRequest::default();
        let node_key = format!("{}/{}", self.node_key_prefix, escape_for_key(&node.id)?);

This should be replaced with self.node_key(&node).

src/query/management/src/warehouse/warehouse_mgr.rs line 148 at r2 (raw file):

        ));

        let warehouse_node_key = self.cluster_key(&node)?;

The variable name wharehouse_node_key is kind of inconsistent with the right side method cluster_key()

src/query/management/src/warehouse/warehouse_mgr.rs line 154 at r2 (raw file):

            self.warehouse_key_prefix,
            escape_for_key(&node.warehouse_id)?
        );

introduce another key-building method for this?

Code quote:

        let warehouse_info_key = format!(
            "{}/{}",
            self.warehouse_key_prefix,
            escape_for_key(&node.warehouse_id)?
        );

src/query/management/src/warehouse/warehouse_mgr.rs line 157 at r2 (raw file):

        node.cluster_id = String::new();
        node.warehouse_id = String::new();

This feels a bit odd: at the beginning of this method, it asserts that these two fields are non-empty, but later they are cleared.

Would it make sense to pass cluster_id and warehouse_id as separate arguments to this method instead?

Code quote:

        node.cluster_id = String::new();
        node.warehouse_id = String::new();

src/meta/types/src/cluster.rs line 100 at r2 (raw file):

    pub warehouse_id: String,

    pub runtime_resource_group: Option<String>,

This field does not need to specify #[serde(skip_serializing_if = "Option::is_none")]?

Encoding a null value may break older version query?

And for the three new fields, is #[serde(skip)] needed to allow it to decode an older version NodeInfo json that does not have these fields?

# Conflicts: # src/query/config/src/config.rs

zhang2014 · 2024-12-29T11:54:03Z

src/meta/types/src/cluster.rs line 100 at r2 (raw file):