gke - deploying batch and training ML learning workload into your cluster

First clone this repository. In the batch/aiml-workload it contains machine example from training code to docker and deployment to gke cluster.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

Next you need to create a gke cluster and filestore.

If you list your filestore, you would have a new ip address :-

Deploy and create a PV

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fileserver
spec:
  capacity:
    storage: 1T
  accessModes:
  - ReadWriteMany
  nfs:
    path: /NFSVol
    server: 10.207.154.66 

Next setup your pvc

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fileserver
spec:
  capacity:
    storage: 1T
  accessModes:
  - ReadWriteMany
  nfs:
    path: /NFSVol
    server: 10.207.154.66 

Then deploy your redis-pod.yaml as shown here

apiVersion: v1
kind: Pod
metadata:
  name: redis-leader
  labels:
    app: redis
spec:
  containers:
    - name: leader
      image: redis
      env:
        - name: LEADER
          value: "true"
      volumeMounts:
        - mountPath: /mnt/fileserver
          name: redis-pvc
      ports:
        - containerPort: 6379
  volumes:
    - name: redis-pvc
      persistentVolumeClaim:
        claimName: fileserver-claim
        readOnly: false

Then transfer training file to the filestore

sh scripts/transfer-datasets.sh

that contains the following code

# Copy files containing training datasets from code repository to the GKE Pod
echo "Copying datasets to Pod 'redis-leader'..."
kubectl cp datasets redis-leader:/mnt/fileserver

Next, run the following jobs

sh scripts/queue-jobs.sh

echo "**************************************"
echo "Populating queue for batch training..."
echo "**************************************"
echo "The following datasets will be queued for processing:"
filenames=""

# Report all the files containing the training datasets
# and create a concatenated string of filenames to add to the Redis queue
for filepath in datasets/training/*.pkl; do
  echo $filepath
  filenames+=" $filepath"
done

# Push filenames to a Redis queue running on the `redis-leader` GKE Pod
QUEUE_LENGTH=$(kubectl exec redis-leader -- /bin/sh -c \
  "redis-cli rpush datasets ${filenames}")

echo "Queue length: ${QUEUE_LENGTH}"

And then deploy the service for your redis

kubectl apply -f ./kubernetes-manifests/redis-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  ports:
    - port: 6379
      targetPort: 6379
  selector:
    app: redis

Next we will deploy the workload which is a k8s job and it is designed to use spot instance as shown here.

kubectl apply -f ./kubernetes-manifests/workload.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: workload
spec:
  parallelism: 1
  template:
    metadata:
      name: workload
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      containers:
      - name: workload
        image: "us-docker.pkg.dev/google-samples/containers/gke/batch-ml-workload"
        volumeMounts:
        - mountPath: /mnt/fileserver
          name: workload-pvc
      volumes:
      - name: workload-pvc
        persistentVolumeClaim:
          claimName: fileserver-claim
          readOnly: false
      restartPolicy: OnFailure

Once the job is completed, you can try to check its file outputs

kubectl exec --stdin --tty redis-leader -- /bin/sh -c "ls -1 /mnt/fileserver/output"

To get an idea what is happening in the workload, you can go into the repository and review the src folder. It contains python code to perform ml training.

References:-

https://cloud.google.com/kubernetes-engine/docs/tutorials/batch-ml-workload

Search This Blog

mitzen

gke - deploying batch and training ML learning workload into your cluster

Comments

Popular posts from this blog

git subtree add gives you "Working tree has modifications. Cannot add"

The specified initialization vector (IV) does not match the block size for this algorithm

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20