Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some feedback ref Chap 9 #4

Closed
Analect opened this issue Aug 22, 2023 · 3 comments
Closed

Some feedback ref Chap 9 #4

Analect opened this issue Aug 22, 2023 · 3 comments

Comments

@Analect
Copy link

Analect commented Aug 22, 2023

@terrytangyuan , as mentioned, for those looking to run the workflow.yaml end-to-end, there were a few infrastructure-related elements introduced in Chap 8 that might be worth recapping at the beginning of Chap 9, so as to ensure that someone focusing on that chapter has the possibility to run things. While you capture certain things in the sub README's on the repo, I have found that sometimes these aren't reflected back in the book .. when they probably should be.

  • cluster creation using k3d: per Chap 8 listing 8.21 k3d cluster create distml --image v1.25.3+k3s1 .... I think you need the 'rancher' part too for this to work k3d cluster create distml --image rancher/k3s:v1.25.3-k3s1.
  • creation of various kubeflow-related resource per 8.31 and 8.32. If I'm not mistaken, some of the service-accounts, roles and CRDs in distributed-ml-patterns/code/project/manifests/kubeflow-training are presumed to be in-place for Chap 9. I would maybe stick to using kubectl and not have expectations on readers to install other short-cut tooling like kns to work with their cluster.
kubectl create ns kubeflow
kubectl config set-context --current --namespace=kubeflow
kubectl kustomize manifests | kubectl apply -f -
  • in listing 9.13, it may be worth explicitly adding the 'kubeflow' namespace, as the workflow won't work otherwise, based on my own experimentation.
kubectl create -n kubeflow -f multi-worker-pvc.yaml
  • what exactly is happening in listing 9.20? ... maybe I'm missing something here, but a more explicit explanation in the context of understanding how this is being used for inferencing here would be great.
  • For the inference-service in 9.21, this gets replaced by an extended version in 9.28. Should the first one be explicitly deleted before the next one is created? In 9.37, the simpler version (9.21) without scaleTarget and scaleMetric is included as a manifest in the workflow. Is this intentional?
  • section 9.4.2 deals with step memoization ... I realise the dataset used for ingestion is a fixed fashion-MNIST and so it doesn't change. It would be great to see an example where that data is changed/augmented and have that trigger a training/model-selection re-run. Maybe the data ingestion can push to a Minio bucket and then Argo Events is used to detect changes to the bucket. That's probably beyond scope at this point.
  • while there is a reference to port-forwarding the argo-workflows service, I think it would benefit from a more explicit write-up of how to login. I found I ended up having to create my own user with the requisite cluster permissions to be able to get the UI working.
  • I saw some references in the code to an image for training-operator at public.ecr.aws/j1r0q0g6/training/training-operator, but also at https://github.com/orgs/kubeflow/packages/container/package/training%2Ftraining-operator. These are discussed here. It might be worth explicitly mentioning where the canonical source will be, going forward.

Overall, kudos for assembling this book and especially the end-to-end workflow in Chap 9. It offers a great blue-print for handling something 'real-world', which is rare to have in these sorts of text-books.

Best, Colum

@terrytangyuan
Copy link
Owner

cluster creation using k3d: per Chap 8 listing 8.21 k3d cluster create distml --image v1.25.3+k3s1 .... I think you need the 'rancher' part too for this to work k3d cluster create distml --image rancher/k3s:v1.25.3-k3s1.

This will be fixed in a new version.

what exactly is happening in listing 9.20? ... maybe I'm missing something here, but a more explicit explanation in the context of understanding how this is being used for inferencing here would be great.

It's a bit beyond the scope of this book so I tried to avoid too much details.

For the inference-service in 9.21, this gets replaced by an extended version in 9.28. Should the first one be explicitly deleted before the next one is created? In 9.37, the simpler version (9.21) without scaleTarget and scaleMetric is included as a manifest in the workflow. Is this intentional?

I only see it in Listing 9.28. I think it's fixed already.

I saw some references in the code to an image for training-operator at public.ecr.aws/j1r0q0g6/training/training-operator

Where do you see those references?

Thank you for your feedback! It might be too late to update the content but we'll try get smaller fixes in before publication.

@Analect
Copy link
Author

Analect commented Aug 22, 2023

Reference to training-operator is in distributed-ml-patterns/code/project/manifests/kubeflow-training/kustomization.yaml. Added a few typos to that Manning Live Book facility ... which perhaps have also been addressed since version 7.

image

Will close this out now. Hope the book does well!

@Analect Analect closed this as completed Aug 22, 2023
@terrytangyuan
Copy link
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants