gke - deploying batch and training ML learning workload into your cluster
First clone this repository. In the batch/aiml-workload it contains machine example from training code to docker and deployment to gke cluster.
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
Next you need to create a gke cluster and filestore.
If you list your filestore, you would have a new ip address :-
Deploy and create a PV
apiVersion: v1
kind: PersistentVolume
metadata:
name: fileserver
spec:
capacity:
storage: 1T
accessModes:
- ReadWriteMany
nfs:
path: /NFSVol
server: 10.207.154.66
apiVersion: v1
kind: PersistentVolume
metadata:
name: fileserver
spec:
capacity:
storage: 1T
accessModes:
- ReadWriteMany
nfs:
path: /NFSVol
server: 10.207.154.66
Then deploy your redis-pod.yaml as shown here
apiVersion: v1
kind: Pod
metadata:
name: redis-leader
labels:
app: redis
spec:
containers:
- name: leader
image: redis
env:
- name: LEADER
value: "true"
volumeMounts:
- mountPath: /mnt/fileserver
name: redis-pvc
ports:
- containerPort: 6379
volumes:
- name: redis-pvc
persistentVolumeClaim:
claimName: fileserver-claim
readOnly: false
Then transfer training file to the filestore
sh scripts/transfer-datasets.sh
that contains the following code
# Copy files containing training datasets from code repository to the GKE Pod
echo "Copying datasets to Pod 'redis-leader'..."
kubectl cp datasets redis-leader:/mnt/fileserver
Next, run the following jobs
sh scripts/queue-jobs.sh
echo "**************************************"
echo "Populating queue for batch training..."
echo "**************************************"
echo "The following datasets will be queued for processing:"
filenames=""
# Report all the files containing the training datasets
# and create a concatenated string of filenames to add to the Redis queue
for filepath in datasets/training/*.pkl; do
echo $filepath
filenames+=" $filepath"
done
# Push filenames to a Redis queue running on the `redis-leader` GKE Pod
QUEUE_LENGTH=$(kubectl exec redis-leader -- /bin/sh -c \
"redis-cli rpush datasets ${filenames}")
echo "Queue length: ${QUEUE_LENGTH}"
And then deploy the service for your redis
kubectl apply -f ./kubernetes-manifests/redis-service.yaml
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
ports:
- port: 6379
targetPort: 6379
selector:
app: redis
Next we will deploy the workload which is a k8s job and it is designed to use spot instance as shown here.
kubectl apply -f ./kubernetes-manifests/workload.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: workload
spec:
parallelism: 1
template:
metadata:
name: workload
spec:
nodeSelector:
cloud.google.com/gke-spot: "true"
containers:
- name: workload
image: "us-docker.pkg.dev/google-samples/containers/gke/batch-ml-workload"
volumeMounts:
- mountPath: /mnt/fileserver
name: workload-pvc
volumes:
- name: workload-pvc
persistentVolumeClaim:
claimName: fileserver-claim
readOnly: false
restartPolicy: OnFailure
Once the job is completed, you can try to check its file outputs
kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"
To get an idea what is happening in the workload, you can go into the repository and review the src folder. It contains python code to perform ml training.
References:-
https://cloud.google.com/kubernetes-engine/docs/tutorials/batch-ml-workload
Comments