Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure INFRA and WORKLOAD machine creation failed #148

Open
qiliRedHat opened this issue May 16, 2022 · 7 comments
Open

Azure INFRA and WORKLOAD machine creation failed #148

qiliRedHat opened this issue May 16, 2022 · 7 comments

Comments

@qiliRedHat
Copy link
Contributor

qiliRedHat commented May 16, 2022

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/708/console

% oc get machineset -A
NAMESPACE               NAME                                               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   infra-northcentralus2                              1         1                             3h5m
openshift-machine-api   infra-northcentralus3                              1         1                             3h5m
openshift-machine-api   infra-qili-preserve-az0516-sr44j1                  1         1                             3h5m
openshift-machine-api   qili-preserve-az0516-sr44j-worker-northcentralus   3         3         3       3           3h55m
openshift-machine-api   workload-qili-preserve-az0516-sr44j                1         1                             3h5m
% oc get machines -A | grep infra
openshift-machine-api   infra-northcentralus2-82z2h                              Failed                                              3h8m
openshift-machine-api   infra-northcentralus3-2klrt                              Failed                                              3h8m
openshift-machine-api   infra-qili-preserve-az0516-sr44j1-lbhxq                  Failed                                              3h8m

Describing the machine, machine creation failed for Please make sure that the referenced resource exists, and that both resources are in the same region

  Error Message:           failed to reconcile machine "infra-northcentralus2-82z2h": network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidResourceReference" Message="Resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Network/virtualNetworks/qili-preserve-az0516-sr44j-vnet/subnets/qili-preserve-az0516-sr44j-worker-subnet referenced by resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Network/networkInterfaces/infra-northcentralus2-82z2h-nic was not found. Please make sure that the referenced resource exists, and that both resources are in the same region." Details=[]

Check the infra machineset yaml, location is centralus.

% oc get machinesets/infra-northcentralus2 -n openshift-machine-api -o yaml
...
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
      machine.openshift.io/cluster-api-machineset: infra-northcentralus2
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
        machine.openshift.io/cluster-api-machine-role: infra
        machine.openshift.io/cluster-api-machine-type: infra
        machine.openshift.io/cluster-api-machineset: infra-northcentralus2
    spec:
      lifecycleHooks: {}
      metadata:
        labels:
          node-role.kubernetes.io/infra: ""
      providerSpec:
        value:
          apiVersion: azureproviderconfig.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Compute/images/qili-preserve-az0516-sr44j
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: qili-preserve-az0516-sr44j-identity
          metadata:
            creationTimestamp: null
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          resourceGroup: qili-preserve-az0516-sr44j-rg
          subnet: qili-preserve-az0516-sr44j-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D48s_v3
          vnet: qili-preserve-az0516-sr44j-vnet
          zone: "2"

Checking code

export AZURE_LOCATION=$(oc get machineset -n openshift-machine-api -o=go-template='{{(index .items 0).spec.template.spec.providerSpec.value.location}}')

But the worker node machinesets is actually on 'northcentralus'

 % oc get machineset/qili-preserve-az0516-sr44j-worker-northcentralus -n openshift-machine-api -o yaml
...
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
      machine.openshift.io/cluster-api-machineset: qili-preserve-az0516-sr44j-worker-northcentralus
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: qili-preserve-az0516-sr44j
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: qili-preserve-az0516-sr44j-worker-northcentralus
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/qili-preserve-az0516-sr44j-rg/providers/Microsoft.Compute/images/qili-preserve-az0516-sr44j-gen2
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: northcentralus
@qiliRedHat
Copy link
Contributor Author

qiliRedHat commented May 16, 2022

Another typo not impacting functionality

name: infra-${CLUSTER_NAME}1

Change to infra-${CLUSTER_REGION}1 to be consist with the 2nd and 3rd infra machinesets name.
name: workload-${CLUSTER_NAME}

Change to workload-${CLUSTER_REGION} to be consist with the infra machinesets.

@qiliRedHat
Copy link
Contributor Author

Got the root cause

is hardcoded to centralus
Will open a PR to fix.

@qiliRedHat
Copy link
Contributor Author

Test failed again

  Warning  FailedCreate  83s   azure-controller  InvalidConfiguration: failed to reconcile machine "infra-northcentralus1-6dpcb": failed to create vm infra-northcentralus1-6dpcb: failure sending request for machine infra-northcentralus1-6dpcb: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidAvailabilityZone" Message="The zone(s) '1' for resource 'Microsoft.Compute/virtualMachines/infra-northcentralus1-6dpcb' is not supported. The supported zones for location 'northcentralus' are ''"

Check worker machineset zone config

% oc get machinesets/qili-preserve-az0516-sr44j-worker-northcentralus -n openshift-machine-api -o yaml | grep -i zone
          zone: ""

Fix this by changing zone to "".
Test with https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/qili-e2e-benchmark/job/cluster-post-config-fix-aure-location/5/

@qiliRedHat
Copy link
Contributor Author

qiliRedHat commented May 16, 2022

Infra machines started to provision, but workload machine failed for quota issue. It was because I created the cluster on for Reliability, I have no plan to scaleup so I didn't use 'centralus' region, default falls on 'northcentralus'. I realized maybe this is the reason why @paigerube14 leave 'centralus' as default and had 1 2 3 for the zone. @paigerube14 your thoughts?
I'll create a new cluster on 'centralus' to test the fix.

  Warning  FailedCreate  110s  azure-controller  InvalidConfiguration: failed to reconcile machine "workload-northcentralus-qw9xx": failed to create vm workload-northcentralus-qw9xx: failure sending request for machine workload-northcentralus-qw9xx: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: northcentralus, Current Limit: 400, Current Usage: 382, Additional Required: 32, (Minimum) New Limit Required: 414. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%2253b8f551-f0fc-4bea-8cba-6d1fefd54c8a%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22northcentralus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDSv3Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:414,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDSv3Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests"

@qiliRedHat
Copy link
Contributor Author

I reversed the change of zone and tested with Job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/qili-e2e-benchmark/job/cluster-post-config-fix-aure-location/8/ successfully
Now infra machines are provisioned and spread on 3 zones.

 % oc get machines -A
NAMESPACE               NAME                                                  PHASE     TYPE               REGION      ZONE   AGE
openshift-machine-api   infra-centralus1-plt57                                Running   Standard_D48s_v3   centralus   1      6m2s
openshift-machine-api   infra-centralus2-bfqlt                                Running   Standard_D48s_v3   centralus   2      6m2s
openshift-machine-api   infra-centralus3-7jqnt                                Running   Standard_D48s_v3   centralus   3      6m2s
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-0                  Running   Standard_D4s_v3    centralus   2      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-1                  Running   Standard_D4s_v3    centralus   1      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-2                  Running   Standard_D4s_v3    centralus   3      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus1-kzd8v   Running   Standard_D4s_v3    centralus   1      54m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus2-xvsmx   Running   Standard_D4s_v3    centralus   2      54m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus3-95ws4   Running   Standard_D4s_v3    centralus   3      54m
openshift-machine-api   workload-centralus-jkpwh                              Running   Standard_D32s_v3   centralus   1      5m53s
% oc get machinesets -A
NAMESPACE               NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   infra-centralus1                                1         1         1       1           6m22s
openshift-machine-api   infra-centralus2                                1         1         1       1           6m22s
openshift-machine-api   infra-centralus3                                1         1         1       1           6m22s
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus1   1         1         1       1           66m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus2   1         1         1       1           66m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus3   1         1         1       1           66m
openshift-machine-api   workload-centralus                              1         1         1       1           6m13s

@qiliRedHat
Copy link
Contributor Author

@paigerube14 Sorry, I think this is not an issue, just we only support location: centralus now.

@paigerube14
Copy link
Contributor

I think it is good to be able to have automation that works for clusters that are both created from automation and not. The PR you opened definitely makes this more usable by all creation types. Getting up a cluster now to test your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants