-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add ml example Signed-off-by: vsoch <[email protected]>
- Loading branch information
Showing
3 changed files
with
109 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# DLIO ML Example | ||
|
||
This is an example of using the IO tool[DLIO](https://dlio-profiler.readthedocs.io/en/latest/build.html#build-dlio-profiler-with-pip-recommended) that can | ||
be added on the fly with pip. | ||
|
||
## Usage | ||
|
||
Create a cluster and install JobSet to it. | ||
|
||
```bash | ||
kind create cluster | ||
VERSION=v0.2.0 | ||
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml | ||
``` | ||
|
||
Install the operator (from the development manifest here): | ||
|
||
```bash | ||
$ kubectl apply -f ../../dist/metrics-operator-dev.yaml | ||
``` | ||
|
||
How to see metrics operator logs: | ||
|
||
```bash | ||
$ kubectl logs -n metrics-system metrics-controller-manager-859c66464c-7rpbw | ||
``` | ||
|
||
Then create the metrics set. This is going to run a single run of LAMMPS over MPI. | ||
as lammps runs. | ||
|
||
```bash | ||
kubectl apply -f metrics.yaml | ||
``` | ||
|
||
Wait until you see pods created by the job and then running (there should be two - a launcher and worker for LAMMPS): | ||
|
||
```bash | ||
kubectl get pods | ||
``` | ||
```diff | ||
NAME READY STATUS RESTARTS AGE | ||
metricset-sample-l-0-0-lt782 1/1 Running 0 3s | ||
metricset-sample-w-0-0-4s5p9 1/1 Running 0 3s | ||
``` | ||
|
||
In the above, "l" is a launcher pod, and "w" is a worker node. | ||
If you inspect the log for the launcher you'll see a short sleep (the network isn't up immediately) | ||
and then LAMMPS running, and the log is printed to the console. | ||
|
||
```bash | ||
kubectl logs metricset-sample-l-0-0-lt782 -f | ||
``` | ||
|
||
There is purposefully a sleep infinity at the end to give you a chance to copy over data. | ||
|
||
```bash | ||
mkdir -p ./data ./output | ||
# Only if you are interested in the ML data | ||
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/data ./data/ | ||
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/output ./output | ||
``` | ||
|
||
You can open the tiny file in [https://ui.perfetto.dev/](https://ui.perfetto.dev/). | ||
|
||
![img/trace.png](img/trace.png) | ||
|
||
Other applications of interest might be related to AI/ML - we will try more soon! | ||
Cleanup when you are done. | ||
|
||
```bash | ||
kubectl delete -f metrics.yaml | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
apiVersion: flux-framework.org/v1alpha2 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
# kubectl apply -f metrics.yaml | ||
# kubectl logs <launcher-pod> -f | ||
pods: 1 | ||
|
||
metrics: | ||
- name: io-ior | ||
options: | ||
command: mpirun --allow-run-as-root -np 10 dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output | ||
workdir: /dlio/data | ||
addons: | ||
- name: commands | ||
options: | ||
preBlock: | | ||
apt-get update && apt-get install -y python3 python3-pip openmpi-bin openmpi-common libopenmpi-dev hwloc libhwloc-dev default-jre | ||
#python3 -m pip install git+https://github.com/hariharan-devarajan/dlio-profiler.git | ||
#python3 -m pip install git+https://github.com/argonne-lcf/dlio_benchmark.git | ||
python3 -m pip install "dlio_benchmark[dlio_profiler] @ git+https://github.com/argonne-lcf/dlio_benchmark.git" | ||
mkdir -p /dlio/data /dlio/output /dlio/logs | ||
export DLIO_PROFILER_ENABLE=0 | ||
mpirun -np 10 --allow-run-as-root dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output ++workload.workflow.generate_data=True ++workload.workflow.train=False | ||
export DLIO_PROFILER_LOG_LEVEL=ERROR | ||
export DLIO_PROFILER_ENABLE=1 | ||
export DLIO_PROFILER_INC_METADATA=1 | ||
cd /dlio/data | ||
postBlock: | | ||
gzip -d /dlio/output/.trace*.pfw.gz | ||
cat /dlio/output/.trace*.pfw | ||
sleep infinity | ||