gke - testing out Mixtral-8x7B-Instruct-v0.1 deployment not successful on a non-gpu pool

To deploy this, it requires GPU. I was trying to deploy without one and wasn't successfully mainly because the images huggingface-text-generation-inference does not have a cpu compatible image.

This is the image that is required

image: us-docker.pkg.dev/

deeplearning-platform-release/gcr.io/huggingface-text-

generation-inference-cu124.2-3.ubuntu2204.py311

First you need to create a secret and ensure you have accepted mistral agreement. If not head over to hugging face and accept the license agreement.

Then create your secret

kubectl create secret generic l4-demo \
  --from-literal=HUGGING_FACE_TOKEN=hf_your-key-here

Then deploy gradio and your model to k8s cluster.

You can find additional detail here.

The gradio deployment looks like this

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service"
        - name: LLM_ENGINE
          value: "tgi"
        - name: MODEL_ID
          value: "mixtral-8x7b"
        - name: USER_PROMPT
          value: "[INST] prompt [/INST]"
        - name: SYSTEM_PROMPT
          value: "prompt"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio-service
spec:
  type: LoadBalancer
  selector:
    app: gradio
  ports:
  - port: 80
    targetPort: 7860

The deployment files looks like this:-

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
        resources:
          requests:
            cpu: "5"
            memory: "40Gi"
            #nvidia.com/gpu: "2"
          limits:
            cpu: "5"
            memory: "40Gi"
            #nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: mistralai/Mixtral-8x7B-Instruct-v0.1
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: l4-demo
              key: HUGGING_FACE_TOKEN          
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          # mountPath is set to /tmp as it's the path where the HF_HOME environment
          # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
          # i.e. where the downloaded model from the Hub will be stored
          - mountPath: /tmp
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 100Gi
      # nodeSelector:
      #   cloud.google.com/gke-accelerator: "nvidia-l4"
      #   cloud.google.com/gke-spot: "true"

Search This Blog

mitzen

gke - testing out Mixtral-8x7B-Instruct-v0.1 deployment not successful on a non-gpu pool

Comments

Popular posts from this blog

git subtree add gives you "Working tree has modifications. Cannot add"

The specified initialization vector (IV) does not match the block size for this algorithm

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20