gke - Train a model with GKE Autopilot

In the google example, it uses gpu to deploy training and inference job into a gke workload but requires gpu. We can easily convert that to run on normal cpu.

First we need to create

1. a gke autopilot cluster

2. create a cloud storage called PROJECTID-gke-gpu-bucket

Next clone the repository

git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Once your autopilot cluster is ready, then connect to it.

Create the require namespace using the following command

kubectl create namespace gke-gpu-namespace

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-gpu-namespace

Create the IAM policy to allow service account access to cloud storage

gcloud storage buckets add-iam-policy-binding gs://PROJECTID-gke-gpu-bucket \
    --member=principal://iam.googleapis.com/projects/PROJECTNUMBER
    /locations/global/workloadIdentityPools/PROJECTID.svc.id.goog/subject
/ns/gke-gpu-namespace/sa/gpu-k8s-sa \
    --role=roles/storage.objectAdmin \
    --condition=None

Next, export the following confirmation

export K8S_SA_NAME=gpu-k8s-sa

export BUCKET_NAME=PROJECT_ID-gke-gpu-bucket

Let's start training and prediction with MNIST dataset

Copy example data over to bucket

gsutil -m cp -R src/tensorflow-mnist-example gs://PROJECT_ID-gke-gpu-bucket/

Then let's deploy training workload using this yaml. Noticed, i removed the requirements for GPU.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-training-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; 
  pip install -r requirements.txt; python tensorflow_mnist_train_distributed.py"]
        resources:
          limits:
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

The training has been completed. You can get this output by just running kubectl logs.

You can see that we have model saved into cloud storage - as shown below :-

Next, we will deploy the inference workload to our cluster using the following yaml.

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-batch-prediction-job
spec:
  template:
    metadata:
      name: mnist
      annotations:
        gke-gcsfuse/volumes: "true"
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest 
        command: ["/bin/bash", "-c", "--"]
        args: ["cd /data/tensorflow-mnist-example; pip install -r requirements.txt; 
python tensorflow_mnist_batch_predict.py"]
        resources:
          limits:
            cpu: 1
            memory: 3Gi
        volumeMounts:
        - name: gcs-fuse-csi-vol
          mountPath: /data
          readOnly: false
      serviceAccountName: $K8S_SA_NAME
      volumes:
      - name: gcs-fuse-csi-vol
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: $BUCKET_NAME
            mountOptions: "implicit-dirs"
      restartPolicy: "Never"

And the outputs are given here:-

Search This Blog

mitzen

gke - Train a model with GKE Autopilot

Comments

Popular posts from this blog

git subtree add gives you "Working tree has modifications. Cannot add"

The specified initialization vector (IV) does not match the block size for this algorithm

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20