kueue for machine learning jobs
Installing Kueue
We can install kueue using the following commands:
You have to wait until kueue controller manager to fully initialize before you can create cluster-queue.
These are the resources being created.
Next we create namespace and deploy cluster-queue.yaml and local-queue.
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
cd kubernetes-engine-samples/batch/kueue-intro
kubectl create namespace team-a
./create_jobs.sh job-team-a.yaml job-team-b.yaml 10After that you can see our jobs running when you run
kubectl get job -n team-a
kubectl get job -n team-b
To get to the workload, you can do a
kubectl describe job/job-name -n team-a
And from there, get the workload name.
kubectl get workload/workload-id-from-job -n team-a
If everything is working well, you should get the following output
Troubleshootting
Job is in suspended mode with reason inadmissible.
Please try to run
kubectl describe ClusterQueue
Then you need to look for error message. In my case it is "Can't admit new workloads: references missing ResourceFlavor(s): default-flavor".
So I need to create my kueue flavour.
Comments