gke - Train a model with GKE Autopilot

In the google example, it uses gpu to deploy training and inference job into a gke workload but requires gpu. We can easily convert that to run on normal cpu. 

First we need to create 

1. a gke autopilot cluster

2. create a cloud storage called PROJECTID-gke-gpu-bucket

Next clone the repository 

git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu


Once your autopilot cluster is ready, then connect to it. 

Create the require namespace using the following command

kubectl create namespace gke-gpu-namespace

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-gpu-namespace

Create the IAM policy to allow service account access to cloud storage 


gcloud storage buckets add-iam-policy-binding gs://PROJECTID-gke-gpu-bucket \
    --member=principal://iam.googleapis.com/projects/PROJECTNUMBER
/locations/global/workloadIdentityPools/PROJECTID.svc.id.goog/subject
/ns/gke-gpu-namespace/sa/gpu-k8s-sa \
    --role=roles/storage.objectAdmin \
    --condition=None


Next, export the following confirmation 

export K8S_SA_NAME=gpu-k8s-sa

export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket


Let's start training and prediction with MNIST dataset

Copy example data over to bucket 

gsutil -m cp -R src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/

Then let's deploy training workload using this yaml. Noticed, i removed the requirements for GPU.


apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example;
pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

The training has been completed. You can get this output by just running kubectl logs.


You can see that we have model saved into cloud storage - as shown below :-


Next, we will deploy the inference workload to our cluster using the following yaml.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt;
python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"





And the outputs are given here:-







Comments

Popular posts from this blog

The specified initialization vector (IV) does not match the block size for this algorithm

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20