gke - deploying batch and training ML learning workload into your cluster

First clone this repository. In the batch/aiml-workload it contains machine example from training code to docker and deployment to gke cluster.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
Next you need to create a gke cluster and filestore. 

If you list your filestore, you would have a new ip address :-


Deploy and create a PV

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fileserver
spec:
  capacity:
    storage: 1T
  accessModes:
  - ReadWriteMany
  nfs:
    path: /NFSVol
    server: 10.207.154.66


Next setup your pvc

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fileserver
spec:
  capacity:
    storage: 1T
  accessModes:
  - ReadWriteMany
  nfs:
    path: /NFSVol
    server: 10.207.154.66

Then deploy your redis-pod.yaml as shown here

apiVersion: v1
kind: Pod
metadata:
  name: redis-leader
  labels:
    app: redis
spec:
  containers:
    - name: leader
      image: redis
      env:
        - name: LEADER
          value: "true"
      volumeMounts:
        - mountPath: /mnt/fileserver
          name: redis-pvc
      ports:
        - containerPort: 6379
  volumes:
    - name: redis-pvc
      persistentVolumeClaim:
        claimName: fileserver-claim
        readOnly: false


Then transfer training file to the filestore 

sh scripts/transfer-datasets.sh

that contains the following code

# Copy files containing training datasets from code repository to the GKE Pod
echo "Copying datasets to Pod 'redis-leader'..."
kubectl cp datasets redis-leader:/mnt/fileserver



Next, run the following jobs

sh scripts/queue-jobs.sh

echo "**************************************"
echo "Populating queue for batch training..."
echo "**************************************"
echo "The following datasets will be queued for processing:"
filenames=""

# Report all the files containing the training datasets
# and create a concatenated string of filenames to add to the Redis queue
for filepath in datasets/training/*.pkl; do
  echo $filepath
  filenames+=" $filepath"
done

# Push filenames to a Redis queue running on the `redis-leader` GKE Pod
QUEUE_LENGTH=$(kubectl exec redis-leader -- /bin/sh -c \
  "redis-cli rpush datasets ${filenames}")

echo "Queue length: ${QUEUE_LENGTH}"


And then deploy the service for your redis

kubectl apply -f ./kubernetes-manifests/redis-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  ports:
    - port: 6379
      targetPort: 6379
  selector:
    app: redis


Next we will deploy the workload which is a k8s job and it is designed to use spot instance as shown here.

kubectl apply -f ./kubernetes-manifests/workload.yaml


apiVersion: batch/v1
kind: Job
metadata:
  name: workload
spec:
  parallelism: 1
  template:
    metadata:
      name: workload
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      containers:
      - name: workload
        image: "us-docker.pkg.dev/google-samples/containers/gke/batch-ml-workload"
        volumeMounts:
        - mountPath: /mnt/fileserver
          name: workload-pvc
      volumes:
      - name: workload-pvc
        persistentVolumeClaim:
          claimName: fileserver-claim
          readOnly: false
      restartPolicy: OnFailure


Once the job is completed, you can try to check its file outputs

kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"



To get an idea what is happening in the workload, you can go into the repository and review the src folder. It contains python code to perform ml training. 

References:-

https://cloud.google.com/kubernetes-engine/docs/tutorials/batch-ml-workload

Comments

Popular posts from this blog

The specified initialization vector (IV) does not match the block size for this algorithm

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20