gke - testing out Mixtral-8x7B-Instruct-v0.1 deployment not successful on a non-gpu pool
To deploy this, it requires GPU. I was trying to deploy without one and wasn't successfully mainly because the images huggingface-text-generation-inference does not have a cpu compatible image.
This is the image that is required
image: us-docker.pkg.dev/
deeplearning-platform-release/gcr.io/huggingface-text-
generation-inference-cu124.2-3.ubuntu2204.py311
First you need to create a secret and ensure you have accepted mistral agreement. If not head over to hugging face and accept the license agreement.
Then create your secret
kubectl create secret generic l4-demo \
--from-literal=HUGGING_FACE_TOKEN=hf_your-key-here
Then deploy gradio and your model to k8s cluster.
You can find additional detail here.
The gradio deployment looks like this
apiVersion: apps/v1
kind: Deployment
metadata:
name: gradio
labels:
app: gradio
spec:
strategy:
type: Recreate
replicas: 1
selector:
matchLabels:
app: gradio
template:
metadata:
labels:
app: gradio
spec:
containers:
- name: gradio
image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
resources:
requests:
cpu: "512m"
memory: "512Mi"
limits:
cpu: "1"
memory: "512Mi"
env:
- name: CONTEXT_PATH
value: "/generate"
- name: HOST
value: "http://llm-service"
- name: LLM_ENGINE
value: "tgi"
- name: MODEL_ID
value: "mixtral-8x7b"
- name: USER_PROMPT
value: "[INST] prompt [/INST]"
- name: SYSTEM_PROMPT
value: "prompt"
ports:
- containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
name: gradio-service
spec:
type: LoadBalancer
selector:
app: gradio
ports:
- port: 80
targetPort: 7860
The deployment files looks like this:-
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm
spec:
replicas: 1
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: llm
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
resources:
requests:
cpu: "5"
memory: "40Gi"
#nvidia.com/gpu: "2"
limits:
cpu: "5"
memory: "40Gi"
#nvidia.com/gpu: "2"
env:
- name: MODEL_ID
value: mistralai/Mixtral-8x7B-Instruct-v0.1
- name: NUM_SHARD
value: "2"
- name: PORT
value: "8080"
- name: QUANTIZE
value: bitsandbytes-nf4
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: l4-demo
key: HUGGING_FACE_TOKEN
volumeMounts:
- mountPath: /dev/shm
name: dshm
# mountPath is set to /tmp as it's the path where the HF_HOME environment
# variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.
# i.e. where the downloaded model from the Hub will be stored
- mountPath: /tmp
name: ephemeral-volume
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: ephemeral-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: ephemeral
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "premium-rwo"
resources:
requests:
storage: 100Gi
# nodeSelector:
# cloud.google.com/gke-accelerator: "nvidia-l4"
# cloud.google.com/gke-spot: "true"
Comments