gke - deploying multiple llm models with gradio interface

 

First create a gke cluster and you will need to be able to support GPU (required by the hugging face text generation inference) . Once you have provision one create a secret :- 


kubectl create secret generic l4-demo \
  --from-literal=HUGGING_FACE_TOKEN=hf_token

Next choose the ML of your choice for deployment. In this example, we going to use falcon. 

Please refer to this page here for other models.


apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310
        resources:
          requests:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "10"
            memory: "60Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: tiiuae/falcon-40b-instruct
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment
          # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be
          # stored
          - mountPath: /data
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 175Gi
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
        cloud.google.com/gke-spot: "true"


Apply the following deployment, 

kubectl apply -f text-generation-inference.yaml

Ensure the pod are running. 

Next create cluster ip service for your workload. 


apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

Run the following deployment

kubectl apply -f llm-service.yaml 


Deploying Gradio chat service

Gradio is a web application that allows you to build chatbot with a llm as the backend. So the idea is to have a web interface instead of using curl command to integrate / interact with your llm. 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service"
        - name: LLM_ENGINE
          value: "tgi"
        - name: MODEL_ID
          value: "falcon-40b-instruct"
        - name: USER_PROMPT
          value: "User: prompt"
        - name: SYSTEM_PROMPT
          value: "Assistant: prompt"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  type: LoadBalancer
  selector:
    app: gradio
  ports:
  - port: 80
    targetPort: 7860

We did mentioned that gradio would be able to talk to different llm. How does all the magic happened?

This gradio implementation uses REST based http post to llm endpoint (above) and that's what make it able to support llama, falcon and mistral. 

A simple example would look something like this here.

You can have a look at the code here.



Comments

Popular posts from this blog

gemini cli getting file not defined error

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20

vllm : Failed to infer device type