Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator #319

Open
asahalyft opened this issue Feb 11, 2021 · 4 comments
Labels

Comments

@asahalyft
Copy link

asahalyft commented Feb 11, 2021

He Team,
I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in

- name: gcr.io/kubeflow-images-public/pytorch-operator
to be 809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator. However, that repo is not accessible from inside our network. So, instead I switched to gcr.io/kubeflow-images-public/pytorch-operator:latest

I cloned this pytorch-operator repo and generated the pytorch operator using kustomize build manifests/ | kubectl apply -f which generates the following yaml - I also customized the namespace.

apiVersion: v1
kind: Namespace
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorchjobs.kubeflow.org
spec:
  additionalPrinterColumns:
  - JSONPath: .status.conditions[-1:].type
    name: State
    type: string
  - JSONPath: .metadata.creationTimestamp
    name: Age
    type: date
  group: kubeflow.org
  names:
    kind: PyTorchJob
    plural: pytorchjobs
    singular: pytorchjob
  scope: Namespaced
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            pytorchReplicaSpecs:
              properties:
                Master:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  versions:
  - name: v1
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - pytorchjobs
  - pytorchjobs/status
  - pytorchjobs/finalizers
  verbs:
  - '*'
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - events
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pytorch-operator
subjects:
- kind: ServiceAccount
  name: pytorch-operator
  namespace: pytorch-operator
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8443"
    prometheus.io/scrape: "true"
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    kustomize.component: pytorch-operator
    name: pytorch-operator
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      kustomize.component: pytorch-operator
      name: pytorch-operator
  template:
    metadata:
      labels:
        kustomize.component: pytorch-operator
        name: pytorch-operator
    spec:
      containers:
      - command:
        - /pytorch-operator.v1
        - --alsologtostderr
        - -v=1
        - --monitoring-port=8443
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: gcr.io/kubeflow-images-public/pytorch-operator:latest
        name: pytorch-operator
      serviceAccountName: pytorch-operator

I applied the above yaml and verified that the operator is running successfully

$ kubectl get pods -n pytorch-operator
NAME                                READY   STATUS    RESTARTS   AGE
pytorch-operator-6746dbbc89-sv2qw   1/1     Running   0          100m
$ 

I then apply the following yaml to create a Distributed PytorchJob.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            lyft.com/ml-platform: ""  
        spec:
          containers:
            - name: pytorch
              image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
              command: 
                - python
              args: 
                - "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
                - "--backend"
                - "nccl"
                - "--epochs"
                - "2"
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            lyft.com/ml-platform: ""  
        spec:
          containers:
            - name: pytorch
              image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
              command: 
                - python
              args: 
                - "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
                - "--backend"
                - "nccl"
                - "--epochs"
                - "2"
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

I see the worker pods failing with ImagePullBackOff Errors
Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'

15m         Normal    BackOff                   pod/pytorch-dist-mnist-nccl-worker-0                   Back-off pulling image "alpine:3.10"
18m         Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Error: ImagePullBackOff
10s         Normal    Scheduled                 pod/pytorch-dist-mnist-nccl-worker-0                   Successfully assigned asaha/pytorch-dist-mnist-nccl-worker-0 to ip-10-44-108-79.ec2.internal
9s          Normal    Pulling                   pod/pytorch-dist-mnist-nccl-worker-0                   Pulling image "alpine:3.10"
8s          Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in <OUR_AWS_ACCOUNT>.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id '<OUR_AWS_ACCOUNT>'
8s          Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Error: ErrImagePull
7s          Normal    BackOff                   pod/pytorch-dist-mnist-nccl-worker-0                   Back-off pulling image "alpine:3.10"
20m         Normal    SuccessfulCreatePod       pytorchjob/pytorch-dist-mnist-nccl                     Created pod: pytorch-dist-mnist-nccl-master-0

Since, the Docker images are fully materialized why would it fail looking for alpine:3.10?

@asahalyft
Copy link
Author

asahalyft commented Feb 18, 2021

HI @gaocegege Is there a plan to look/comment on this issue?

@gaocegege
Copy link
Member

Yeah, we are trying to use Amazon's new public docker registry. Ref kubeflow/training-operator#1205

@Jeffwan
Copy link
Member

Jeffwan commented Mar 25, 2021

809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator is used for internal testing. Once we move to public registry, we will make a change. It's been changed to use GCR in master now.

@asahalyft
Copy link
Author

asahalyft commented Jun 16, 2021

Hi @Jeffwan I am still getting the alpine image not found when we apply a PytorchJob yaml even with kubeflow 1.3.0 manifest.

Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'

I applied this PytorchJob yaml. I also used kubeflow manifests 1.3.0 and kustomize to generate the pytorch-operator crds and operator yamls and applied them. The pytorch-operator logs shows that the operator is running fine.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: gcr.io/kubeflow-images-public/pytorch-dist-mnist-test:latest
              args: ["--backend", "nccl"]
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources: 
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/kubeflow-images-public/pytorch-dist-mnist-test:latest
              args: ["--backend", "nccl"]
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources: 
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants