kueue for machine learning jobs

 Installing Kueue 

We can install kueue using the following commands:

export VERSION=v0.13.4

kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

You have to wait until kueue controller manager to fully initialize before you can create cluster-queue.

These are the resources being created.  



Next we create namespace and deploy cluster-queue.yaml and local-queue. 

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

cd kubernetes-engine-samples/batch/kueue-intro

kubectl create namespace team-a

kubectl create namespace team-b

kubectl apply -f cluster-queue.yaml
kubectl apply -f local-queue.yaml
kubectl apply -f flavour.yaml


./create_jobs.sh job-team-a.yaml job-team-b.yaml 10


I think the resource request is quite high. So you might want to change to
it to smaller to save some cost.

After that you can see our jobs running when you run 

kubectl get job -n team-a

kubectl get job -n team-b


To get to the workload, you can do a 

kubectl describe job/job-name -n team-a

And from there, get the workload name. 

kubectl get workload/workload-id-from-job -n team-a

If everything is working well, you should get the following output 





Troubleshootting 

Job is in suspended mode with reason inadmissible. 

Please try to run 

kubectl describe ClusterQueue

Then you need to look for error message. In my case it is "Can't admit new workloads: references missing ResourceFlavor(s): default-flavor". 

So I need to create my kueue flavour.





Comments

Popular posts from this blog

gemini cli getting file not defined error

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20

vllm : Failed to infer device type