Some feedback ref Chap 9 #4

Analect · 2023-08-22T16:43:05Z

@terrytangyuan , as mentioned, for those looking to run the workflow.yaml end-to-end, there were a few infrastructure-related elements introduced in Chap 8 that might be worth recapping at the beginning of Chap 9, so as to ensure that someone focusing on that chapter has the possibility to run things. While you capture certain things in the sub README's on the repo, I have found that sometimes these aren't reflected back in the book .. when they probably should be.

cluster creation using k3d: per Chap 8 listing 8.21 k3d cluster create distml --image v1.25.3+k3s1 .... I think you need the 'rancher' part too for this to work k3d cluster create distml --image rancher/k3s:v1.25.3-k3s1.
creation of various kubeflow-related resource per 8.31 and 8.32. If I'm not mistaken, some of the service-accounts, roles and CRDs in distributed-ml-patterns/code/project/manifests/kubeflow-training are presumed to be in-place for Chap 9. I would maybe stick to using kubectl and not have expectations on readers to install other short-cut tooling like kns to work with their cluster.

kubectl create ns kubeflow
kubectl config set-context --current --namespace=kubeflow
kubectl kustomize manifests | kubectl apply -f -

in listing 9.13, it may be worth explicitly adding the 'kubeflow' namespace, as the workflow won't work otherwise, based on my own experimentation.

kubectl create -n kubeflow -f multi-worker-pvc.yaml

what exactly is happening in listing 9.20? ... maybe I'm missing something here, but a more explicit explanation in the context of understanding how this is being used for inferencing here would be great.
For the inference-service in 9.21, this gets replaced by an extended version in 9.28. Should the first one be explicitly deleted before the next one is created? In 9.37, the simpler version (9.21) without scaleTarget and scaleMetric is included as a manifest in the workflow. Is this intentional?
section 9.4.2 deals with step memoization ... I realise the dataset used for ingestion is a fixed fashion-MNIST and so it doesn't change. It would be great to see an example where that data is changed/augmented and have that trigger a training/model-selection re-run. Maybe the data ingestion can push to a Minio bucket and then Argo Events is used to detect changes to the bucket. That's probably beyond scope at this point.
while there is a reference to port-forwarding the argo-workflows service, I think it would benefit from a more explicit write-up of how to login. I found I ended up having to create my own user with the requisite cluster permissions to be able to get the UI working.
I saw some references in the code to an image for training-operator at public.ecr.aws/j1r0q0g6/training/training-operator, but also at https://github.com/orgs/kubeflow/packages/container/package/training%2Ftraining-operator. These are discussed here. It might be worth explicitly mentioning where the canonical source will be, going forward.

Overall, kudos for assembling this book and especially the end-to-end workflow in Chap 9. It offers a great blue-print for handling something 'real-world', which is rare to have in these sorts of text-books.

Best, Colum

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2023-08-22T18:16:00Z

cluster creation using k3d: per Chap 8 listing 8.21 k3d cluster create distml --image v1.25.3+k3s1 .... I think you need the 'rancher' part too for this to work k3d cluster create distml --image rancher/k3s:v1.25.3-k3s1.

This will be fixed in a new version.

what exactly is happening in listing 9.20? ... maybe I'm missing something here, but a more explicit explanation in the context of understanding how this is being used for inferencing here would be great.

It's a bit beyond the scope of this book so I tried to avoid too much details.

For the inference-service in 9.21, this gets replaced by an extended version in 9.28. Should the first one be explicitly deleted before the next one is created? In 9.37, the simpler version (9.21) without scaleTarget and scaleMetric is included as a manifest in the workflow. Is this intentional?

I only see it in Listing 9.28. I think it's fixed already.

I saw some references in the code to an image for training-operator at public.ecr.aws/j1r0q0g6/training/training-operator

Where do you see those references?

Thank you for your feedback! It might be too late to update the content but we'll try get smaller fixes in before publication.

Analect · 2023-08-22T20:25:45Z

Reference to training-operator is in distributed-ml-patterns/code/project/manifests/kubeflow-training/kustomization.yaml. Added a few typos to that Manning Live Book facility ... which perhaps have also been addressed since version 7.

Will close this out now. Hope the book does well!

terrytangyuan · 2023-08-22T20:40:30Z

Thank you!

Analect closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some feedback ref Chap 9 #4

Some feedback ref Chap 9 #4

Analect commented Aug 22, 2023

terrytangyuan commented Aug 22, 2023

Analect commented Aug 22, 2023

terrytangyuan commented Aug 22, 2023

Some feedback ref Chap 9 #4

Some feedback ref Chap 9 #4

Comments

Analect commented Aug 22, 2023

terrytangyuan commented Aug 22, 2023

Analect commented Aug 22, 2023

terrytangyuan commented Aug 22, 2023