Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

[Feature] Add HuaweiCloud provider #1011

Open
kiwik opened this issue Dec 16, 2022 · 49 comments
Open

[Feature] Add HuaweiCloud provider #1011

kiwik opened this issue Dec 16, 2022 · 49 comments

Comments

@kiwik
Copy link
Contributor

kiwik commented Dec 16, 2022

If CloudTik can support huaweicloud provider, it's great. We would like using CloudTik to launch Spark/ML Cluster with OAP enhancement on HuaweiCloud, and I can help to implement it.

https://www.huaweicloud.com/

@jerrychenhf
Copy link
Contributor

Hi @kiwik, it's great that you can contribute to CloudTik.
If you want to implement support for HuaweiCloud, you need to implement:

  1. the NodeProvider: Implement the functions to creating or terminating nodes with tags.
  2. the WorkspaceProvider: Implement the workspace resources provisioning (the VPN, identities and roles, cloud storage) given a workspace name.

You can submit a PR if you have some code.

@kiwik
Copy link
Contributor Author

kiwik commented Jan 5, 2023

@jerrychenhf thank you reminding the key message, I just start to read CloudTik document in these days, don't design and coding yet. I will discuss with my colleagues about it, maybe commit design in following months.

Our team is intersting in Big Data and AI technoloy and enhancement, we can build cooperation with oap-project team in severl open source projects.

@kiwik
Copy link
Contributor Author

kiwik commented Jan 13, 2023

Task list:

  • Huaweicloud credentials
    • Update workspace-schema.json
    • Check login and testing
  • Update WorkspaceProvider
    • Huaweicloud ECS Client
    • VPC API
    • Subnet API
    • ...
  • Update NodeProvider
    • Create Node
    • Terminate Node
    • Update XXX API
    • Info API
  • Templates and examples

@jerrychenhf
Copy link
Contributor

@kiwik Great to hear that!

As to the tasks, it's exactly the right sequence. You can start with implement the workspace create, delete and status which implement the VPC design (VPC and subnets, firewalls, ... with or without public IP, with VPC peering or not), identity and roles for instance authentication and authorization, and managed cloud storage for workspace.

Once the workspace is implemented, it's ready to implement the Node Provider to create or delete instances, tagging, get instance information and so on.

For Spark (and other workload) to access the cloud storage, there some lightweight implementation in Runtime configuration steps. But this can leave to the last step.

CloudTik aslo support K8S provider with integration with Cloud (mostly related OIDC provider integration for identity and roles). If Huawei Cloud have a K8S engine to integrate with Huawei cloud resources, a integration layer can be developed for Cloud Kubernetes Provider.

@kiwik
Copy link
Contributor Author

kiwik commented Feb 6, 2023

Thank you @jerrychenhf to append so many details, it help to make whole workflow clear.

On high level plan, I have started to implement HuaweiCloud ECS provider first, that based on virtual machine, and HuaweiCloud support K8S engine too, named CCE service, I will implement CCE provider after ECS provider is ready.

I will commit the workspace related functions for ECS provider in this week. I plan to split whole HuaweiCloud provider code to a series of patch sets that foucs on a certain feature, like: workspace, node provider and so on. Hopefully, this small-scale PR will make code reviewing a little easier.

kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 7, 2023
1. Create and delete workspace networking resources
2. Add HUAWEICLOUD SDK package into setup.py and requirements.txt
3. Add HUAWEICLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 7, 2023
1. Create and delete workspace networking resources
2. Add HUAWEICLOUD SDK package into setup.py and requirements.txt
3. Add HUAWEICLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 9, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEICLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEICLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 10, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 10, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 15, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 15, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Feb 15, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: oap-project#1011
jerrychenhf pushed a commit that referenced this issue Feb 16, 2023
1. Create and delete workspace networking resources and cloud storage.
2. Add HUAWEI CLOUD SDK package into setup.py and requirements.txt.
3. Add HUAWEI CLOUD default config files, and update schema.

Related-with: #1011
@jerrychenhf
Copy link
Contributor

jerrychenhf commented Feb 16, 2023

@kiwik The plan looks great! The workspace functions has been committed.
The next step will be the node provider implementation which is the core for bring the cluster up and running at Huawei Cloud.

While you are implementing, I will be able to help in some aspects when I free up. A few notes for helping your implementation of this part:

  1. The key connection between the Workspace and cluster (node provider) is the bootstrap process. node_provider.bootstrap_config. The bootstrap_config process will configure all the necessary parameters from the workspace and cluster options into node_config for each node types (head and worker). An idea design of the parameters under the node_config is the identical config_map that is acceptable to Huawei Cloud run_instances API (or other similar name). the bootstrap information includes:
    a. Set the right IAM role based on workspace for the instance (head or worker) to node_config
    b. Generate and create the key pair for the cluster if user doesn't specify one and set to the key pair name to node_config
    c. Set the managed cloud storage information of the workspace to cluster config under storage section.
    d. Set the VPC and subnets information of the workspace to node_config
    e. Set the security group id of the workspace to node_config for creating instances.
    f. Set the latest ubuntu_20_04 image as the image id to use to node_config if user doesn't specify one.
    h. Set the spot preference based on the common option in provider to node_config if user doesn't override in node_config
    You can refer to the existing provider bootstrap implementation and modify.
  2. The node_provider implementation. Assume that node_config has almost every options that needed to create instance. Add tag handling. You can start with the existing provider and modify. It will be straight forward.

Look forward to your next patch for this. I can help to start with the Spark runtime and fuse support for Huawei cloud storage.

One best practice is to follow an existing node provider (the most close one to HuaweiCloud API). Avoid changes without specific reason so that the new implementation will more likely not breaking any assumptions.

@jerrychenhf
Copy link
Contributor

@kiwik I merged the code for Huawei Cloud Hadoop to integrate obs storage. #1138

A few issues found in Hadoop Huawei project (https://github.com/huaweicloud/obsa-hdfs)
huaweicloud/obsa-hdfs#14

I made some modifications to the source code and compiled Hadoop 3.3.1 version of hadoop-huawei jar.
Maybe you can push huawei team to fix the issue in the repo.

@jerrychenhf
Copy link
Contributor

@kiwik I committed a PR for Huawei cloud node provider to integrate work runtime storage configurations. #1139

Please have a check. You need to implement the remaining node provider methods for instance operations. (See TODO in node_provider and TODO in bootstrap_huaweicloud)

@jerrychenhf
Copy link
Contributor

jerrychenhf commented Feb 16, 2023

@kiwik #1140 PR made some improvements to the make_xxx_client so that the credential handling logic is shared and can be improved more easily in the future.
One additional note:
I add optional region if the region for creating client with region other than the region in the provider config. This is the case for _get_current_vpc call of the working node (instance) because the working node region may be different than the region set in the provider (workspace region) when VPC peering is used. So the working node vpc is in the region different than than region of the workspace VPC (this is allowed) and Peering is established between the two VPC in different region (Huawei Cloud allow this? I think so).

So the following code:

`
def _get_current_vpc(config, vpc_client=None):

vm_loca_ip_url = HWC_VM_METADATA_URL + 'local-ipv4'
response = requests.get(vm_loca_ip_url)
vm_local_ip = response.text
ecs_client = make_ecs_client(config)
response = ecs_client.list_servers_details(
    ListServersDetailsRequest(ip=vm_local_ip))

`

ecs_client = make_ecs_client(config) shoud be ->

ecs_client = make_ecs_client(config, region_of_working_node)

@kiwik
Copy link
Contributor Author

kiwik commented Feb 17, 2023

I ping Huawei Cloud obsa-hdfs maintainer, hope him can help. ^

And so nice to you to help add framework of Node provider, it's a entry point, let me starting to solve out whole code path.
I will continue to implement the PR, after rebase on you commits, thank you very much.

@jerrychenhf
Copy link
Contributor

@kiwik It's great!
I was checking fuse mount feature for a forward looking thing.
I found obsfs (https://support.huaweicloud.com/fstg-obs/obs_12_0001.html). It mentioned "obsfs只支持挂载OBS并行文件系统,不支持挂载对象存储桶。", what does this mean? Can we use it use the current managed OBS bucket you created in CloudTik?

mount cloud fs to local is very useful feature especially for ML/DL cases.

Additionally, obsfs seemed to be quite old and doesn't mention to support newer Linux versions such as 20.04 (we use).

@kiwik
Copy link
Contributor Author

kiwik commented Feb 17, 2023

So the working node vpc is in the region different than than region of the workspace VPC (this is allowed) and Peering is established between the two VPC in different region (Huawei Cloud allow this? I think so).

It's different in Huawei Cloud, for the concept VPC Peering in Huawei Cloud only support to connect VPCs in same region, and concept Cloud Connect for VPCs cross regions, see following refer. They apply different API and SDK, so I perfer to keep it simple for CloudTik in the first Huawei Cloud support release, support VPC Peering in same region right now, then support Cloud Connect cross region in the future, maybe we can add some describe and limitation into CloudTik document for Huawei Cloud provider, something like Huawei Cloud only support VPC Peering between different VPCs in the same region. What do you think?

VPC Peering: https://support.huaweicloud.com/usermanual-vpc/zh-cn_topic_0046655036.html
Cloud Connect: https://support.huaweicloud.com/function-cc/index.html

@kiwik
Copy link
Contributor Author

kiwik commented Feb 17, 2023

It mentioned "obsfs只支持挂载OBS并行文件系统,不支持挂载对象存储桶。", what does this mean? Can we use it use the current managed OBS bucket you created in CloudTik?

You are right, I should update _check_and_create_cloud_storage_bucket function to add HTTP request head parameter to enable OBS parallel file system in create bucket API, sorry my fault, forget to add it, will update in following PR.

refer: https://support.huaweicloud.com/api-obs/obs_04_0021.html

@kiwik
Copy link
Contributor Author

kiwik commented Feb 17, 2023

@jerrychenhf ^ a quick fix in order to don't block your works.

@jerrychenhf
Copy link
Contributor

So the working node vpc is in the region different than than region of the workspace VPC (this is allowed) and Peering is established between the two VPC in different region (Huawei Cloud allow this? I think so).

It's different in Huawei Cloud, for the concept VPC Peering in Huawei Cloud only support to connect VPCs in same region, and concept Cloud Connect for VPCs cross regions, see following refer. They apply different API and SDK, so I perfer to keep it simple for CloudTik in the first Huawei Cloud support release, support VPC Peering in same region right now, then support Cloud Connect cross region in the future, maybe we can add some describe and limitation into CloudTik document for Huawei Cloud provider, something like Huawei Cloud only support VPC Peering between different VPCs in the same region. What do you think?

VPC Peering: https://support.huaweicloud.com/usermanual-vpc/zh-cn_topic_0046655036.html Cloud Connect: https://support.huaweicloud.com/function-cc/index.html

I see. Thanks!

@jerrychenhf
Copy link
Contributor

jerrychenhf commented Feb 17, 2023

@jerrychenhf ^ a quick fix in order to don't block your works.
Thanks @kiwik!

jerrychenhf pushed a commit that referenced this issue Feb 17, 2023
@kiwik
Copy link
Contributor Author

kiwik commented Mar 22, 2023

Test env: openEuler 20.03 LTS SP3 OS on Arm64 HuaweiCloud VM
Cloudtik config yaml: example/cluster/huaweicloud/example-workspace.yaml
Cloudtik code base: build from source 1.0.7-dev commit a849f67

Test cases (commands):

  • cloudtik workspace create
  • cloudtik workspace delete
  • cloudtik workspace status
  • cloudtik workspace update-firewalls
  • cloudtik workspace info
  • cloudtik workspace show-clusters
  • cloudtik start
  • cloudtik info
  • cloudtik status
  • cloudtik attach
  • cloudtik exec
  • cloudtik rsync-up
  • cloudtik rsync-down
  • cloudtik submit
  • cloudtik monitor
  • cloudtik health-check
  • cloudtik debug-status
  • cloudtik process-status

kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 23, 2023
1. Enable OBSClient security provider policy chain, it can work
with ENV "OBS_ACCESS_KEY_ID" and "OBS_SECRET_ACCESS_KEY" or
ECS agent to get AK/SK automatically, it's disable by default.
2. Fix cloud.storage.uri return a whole OBS bucket URI for
command "cloudtik workspace info"

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 23, 2023
1. Enable OBSClient security provider policy chain, it can work
with ENV "OBS_ACCESS_KEY_ID" and "OBS_SECRET_ACCESS_KEY" or
ECS agent to get AK/SK automatically, it's disable by default.
2. Fix cloud.storage.uri return a whole OBS bucket URI for
command "cloudtik workspace info"

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 23, 2023
1. Enable OBSClient security provider policy chain, it can work
with ENV "OBS_ACCESS_KEY_ID" and "OBS_SECRET_ACCESS_KEY" or
ECS agent to get AK/SK automatically, it's disable by default.
2. Fix cloud.storage.uri return a whole OBS bucket URI for
command "cloudtik workspace info"

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 23, 2023
1. Enable OBSClient security provider policy chain, it can work
with ENV "OBS_ACCESS_KEY_ID" and "OBS_SECRET_ACCESS_KEY" or
ECS agent to get AK/SK automatically, it's disable by default.
2. Fix cloud.storage.uri return a whole OBS bucket URI for
command "cloudtik workspace info"
3. Fix ECS create server error for command "cloudtik start",
remove unnecessary item in dict of create_server.

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 25, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 25, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 27, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule
5. Add workspace subnet DNS option

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 27, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule
5. Add workspace subnet DNS option
6. Add configurable workspace bandwidth option for EIP and NAT

Related-with: oap-project#1011
jerrychenhf pushed a commit that referenced this issue Mar 27, 2023
1. Enable OBSClient security provider policy chain, it can work
with ENV "OBS_ACCESS_KEY_ID" and "OBS_SECRET_ACCESS_KEY" or
ECS agent to get AK/SK automatically, it's disable by default.
2. Fix cloud.storage.uri return a whole OBS bucket URI for
command "cloudtik workspace info"
3. Fix ECS create server error for command "cloudtik start",
remove unnecessary item in dict of create_server.

Related-with: #1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 28, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule
5. Add workspace subnet DNS option
6. Add configurable workspace bandwidth option for EIP and NAT
7. Add fs.obs.endpoint in core-site.xml for HuaweiCloud provider

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 31, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule
5. Add workspace subnet DNS option
6. Add configurable workspace bandwidth option for EIP and NAT
7. Add fs.obs.endpoint in core-site.xml for HuaweiCloud provider

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Mar 31, 2023
jerrychenhf pushed a commit that referenced this issue Mar 31, 2023
1. "ap-southeast-3" region is in Singapore, we can use more stable
and speed networking access to some resources.
2. Update server flavor and image to match the region.
3. Allocate and attach EIP to head node
4. Add workspace security group egress rule
5. Add workspace subnet DNS option
6. Add configurable workspace bandwidth option for EIP and NAT
7. Add fs.obs.endpoint in core-site.xml for HuaweiCloud provider

Related-with: #1011
@jerrychenhf
Copy link
Contributor

@kiwik One question to the recent change of the default region and image_ref and flavor_ref:
I noticed that when you change default region to ap-southeast-3, you also changed the image_ref and flavor_ref.

So what happen when user set the region in the configuration file to another region, does he have to change the image_ref and flavor_ref? or the current default image_ref and flavor_ref works for other regions too?

If the current default image_ref and flavor_ref works only for the current default region you set (ap-southeast-3), we would need some improvements for better user expeierences:

  1. When user switch a region, we can get the right image_ref (We did similar things in AWS, because different regions have a different image id)
  2. Ideally, the flavor_ref should available for all regions which means switch regions doesn't need to change flavor_ref. I don't have any reason that instance type names should be different across regions.

Thanks,
Haifeng

@kiwik
Copy link
Contributor Author

kiwik commented Mar 31, 2023

Understood, your opinion make sense, the default value should be available for most cases to avoid user changing configure file.

Acturally for HuaweiCloud the flavor_ref is uniqued in different across regions, but some flavor may be sold out or only apply latest generation flavor in some region, flavor_ref ai1s.* is wider used than ai1.* in HuaweiCloud, so I change it.

https://support.huaweicloud.com/productdesc-ecs/ecs_01_0047.html

@jerrychenhf
Copy link
Contributor

jerrychenhf commented Mar 31, 2023

@kiwik So the improvement is only needed for image_ref.

Please refer to _configure_ami function for AWS bootstrap step for configuring automatically the image id.

The basic logic is if user specified a image in the configuration file, we use that. If user don't specify one explicitly, we take two steps to get the image id, first try to using API to listing the image id satisfy our needs for that region, use that if there is one. We also keep a list of static known image ids for major regions and use it as the last choice.

Using this method, we don't need the default image value in the default.yaml file so that we can distinguish whether user explicitly specify one or not.

@kiwik
Copy link
Contributor Author

kiwik commented Apr 6, 2023

Please refer to _configure_ami function for AWS bootstrap step for configuring automatically the image id.

No problem, thank you showing a reference example.

jerrychenhf pushed a commit that referenced this issue Apr 7, 2023
@kiwik
Copy link
Contributor Author

kiwik commented Apr 7, 2023

Hi @jerrychenhf , I update core-site.xml in b2b231b#diff-1a3d9265e27b4d4bbb41c8ae9f94e923f2ea4e3f77473d7d1c980527fb3b18fa , how can I trigger the docker image updating to apply the changes?

kiwik added a commit to kiwik/cloudtik that referenced this issue Apr 7, 2023
1.Add "op_svc_userid" into head node metadata so that head node can
apply temp agancy AK/SK context to launche worker nodes with workspace
keypair.
2.Add default security ingress rule in workspace security group

Related-with: oap-project#1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Apr 7, 2023
1.Add "op_svc_userid" into worker node metadata so that head node can
apply temp agancy AK/SK context to launche worker nodes with workspace
keypair.
2.Add default security ingress rule in workspace security group

Related-with: oap-project#1011
jerrychenhf pushed a commit that referenced this issue Apr 10, 2023
1.Add "op_svc_userid" into worker node metadata so that head node can
apply temp agancy AK/SK context to launche worker nodes with workspace
keypair.
2.Add default security ingress rule in workspace security group

Related-with: #1011
kiwik added a commit to kiwik/cloudtik that referenced this issue Apr 10, 2023
1. Remove the image_ref UUID in defaults.yaml in HuaweiCloud provider,
   try to get default image if user don't specify image_ref
2. Update HuaweiCloud Python SDK versions

Related-with: oap-project#1011
jerrychenhf pushed a commit that referenced this issue Apr 13, 2023
1. Remove the image_ref UUID in defaults.yaml in HuaweiCloud provider,
   try to get default image if user don't specify image_ref
2. Update HuaweiCloud Python SDK versions

Related-with: #1011
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants