gke running batch ml workload using redis and spot instances

We can setup kubernetes job to run our machine learning workload that uses redis to manage job queues and save eheckpoints to filestore. 

First we need to create our filestore by running the following command 

  gcloud filestore instances create batch-aiml-filestore \

        --zone=australia-southeast2-a \
        --tier=BASIC_HDD \
        --file-share=name="NFSVol",capacity=1TB \
        --network=name="default"


Next we will replace this filestore IP in our kubernete manifest but we need to get the ip address of our filestore

gcloud filestore instances list \
    --project=$PROJECT_ID \
    --zone=australia-southeast2-a

Next we can proceed by cloning GCP kubernetes samples repository.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples cd kubernetes-engine-samples/batch/aiml-workloads

sed -i "\
  s/<FILESTORE_IP_ADDRESS>/192.168.147.210/g" \
  kubernetes-manifests/persistent-volume.yaml

Then we can apply the following files 

kubectl apply -f kubernetes-manifests/persistent-volume.yaml
kubectl apply -f kubernetes-manifests/persistent-volume-claim.yaml
kubectl apply -f kubernetes-manifests/redis-pod.yaml
kubectl apply -f ./kubernetes-manifests/redis-service.yaml


We will not setup our training data by transfering data into the volume we created earlier. 

sh scripts/transfer-datasets.sh

Under the hood it is running the following commmand

echo "Copying datasets to Pod 'redis-leader'..."
kubectl cp datasets redis-leader:/mnt/fileserver


Next we will run the batch job. FYI, the workload doesn't do actual training.

kubectl apply -f ./kubernetes-manifests/workload.yaml

If you look at the logs, you will see similiar outputs.



To view files on the filestore you can run the following code. There's no equilvalent command to list files.

kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"





Comments

Popular posts from this blog

gemini cli getting file not defined error

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20

vllm : Failed to infer device type