gke running batch ml workload using redis and spot instances

We can setup kubernetes job to run our machine learning workload that uses redis to manage job queues and save eheckpoints to filestore.

First we need to create our filestore by running the following command

gcloud filestore instances create batch-aiml-filestore \

        --zone=australia-southeast2-a \
        --tier=BASIC_HDD \
        --file-share=name="NFSVol",capacity=1TB \
        --network=name="default"

Next we will replace this filestore IP in our kubernete manifest but we need to get the ip address of our filestore

gcloud filestore instances list \
    --project=$PROJECT_ID \
    --zone=australia-southeast2-a

Next we can proceed by cloning GCP kubernetes samples repository.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples cd kubernetes-engine-samples/batch/aiml-workloads

sed -i "\

  s/<FILESTORE_IP_ADDRESS>/192.168.147.210/g" \

  kubernetes-manifests/persistent-volume.yaml

Then we can apply the following files

kubectl apply -f kubernetes-manifests/persistent-volume.yaml

kubectl apply -f kubernetes-manifests/persistent-volume-claim.yaml

kubectl apply -f kubernetes-manifests/redis-pod.yaml

kubectl apply -f ./kubernetes-manifests/redis-service.yaml

We will not setup our training data by transfering data into the volume we created earlier.

sh scripts/transfer-datasets.sh

Under the hood it is running the following commmand

echo "Copying datasets to Pod 'redis-leader'..."

kubectl cp datasets redis-leader:/mnt/fileserver

Next we will run the batch job. FYI, the workload doesn't do actual training.

kubectl apply -f ./kubernetes-manifests/workload.yaml

If you look at the logs, you will see similiar outputs.

To view files on the filestore you can run the following code. There's no equilvalent command to list files.

kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"

mitzen