gke running batch ml workload using redis and spot instances
We can setup kubernetes job to run our machine learning workload that uses redis to manage job queues and save eheckpoints to filestore.
First we need to create our filestore by running the following command
gcloud filestore instances create batch-aiml-filestore \
--zone=australia-southeast2-a \
--tier=BASIC_HDD \
--file-share=name="NFSVol",capacity=1TB \
--network=name="default"
Next we will replace this filestore IP in our kubernete manifest but we need to get the ip address of our filestore
gcloud filestore instances list \
--project=$PROJECT_ID \
--zone=australia-southeast2-a
Next we can proceed by cloning GCP kubernetes samples repository.
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
cd kubernetes-engine-samples/batch/aiml-workloads
sed -i "\
s/<FILESTORE_IP_ADDRESS>/192.168.147.210/g" \
kubernetes-manifests/persistent-volume.yaml
Then we can apply the following files
kubectl apply -f kubernetes-manifests/persistent-volume.yaml
kubectl apply -f kubernetes-manifests/persistent-volume-claim.yaml
kubectl apply -f kubernetes-manifests/redis-pod.yaml
kubectl apply -f ./kubernetes-manifests/redis-service.yaml
We will not setup our training data by transfering data into the volume we created earlier.
sh scripts/transfer-datasets.shUnder the hood it is running the following commmand
echo "Copying datasets to Pod 'redis-leader'..."
kubectl cp datasets redis-leader:/mnt/fileserver
Next we will run the batch job. FYI, the workload doesn't do actual training.
kubectl apply -f ./kubernetes-manifests/workload.yamlIf you look at the logs, you will see similiar outputs.
To view files on the filestore you can run the following code. There's no equilvalent command to list files.
kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"
Comments