First create a gke cluster and you will need to be able to support GPU (required by the hugging face text generation inference) . Once you have provision one create a secret :-
kubectl create secret generic l4-demo \
--from-literal=HUGGING_FACE_TOKEN=hf_token
Next choose the ML of your choice for deployment. In this example, we going to use falcon.
Please refer to this page here for other models.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm
spec:
replicas: 1
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: llm
image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310
resources:
requests:
cpu: "10"
memory: "60Gi"
nvidia.com/gpu: "2"
limits:
cpu: "10"
memory: "60Gi"
nvidia.com/gpu: "2"
env:
- name: MODEL_ID
value: tiiuae/falcon-40b-instruct
- name: NUM_SHARD
value: "2"
- name: PORT
value: "8080"
- name: QUANTIZE
value: bitsandbytes-nf4
volumeMounts:
- mountPath: /dev/shm
name: dshm
# mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment
# variable points to in the TGI container image i.e. where the downloaded model from the Hub will be
# stored
- mountPath: /data
name: ephemeral-volume
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: ephemeral-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: ephemeral
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "premium-rwo"
resources:
requests:
storage: 175Gi
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-l4"
cloud.google.com/gke-spot: "true"
Apply the following deployment,
kubectl apply -f text-generation-inference.yaml
Ensure the pod are running.
Next create cluster ip service for your workload.
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm
type: ClusterIP
ports:
- protocol: TCP
port: 80
targetPort: 8080
Run the following deployment
kubectl apply -f llm-service.yaml
Deploying Gradio chat service
Gradio is a web application that allows you to build chatbot with a llm as the backend. So the idea is to have a web interface instead of using curl command to integrate / interact with your llm.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gradio
labels:
app: gradio
spec:
strategy:
type: Recreate
replicas: 1
selector:
matchLabels:
app: gradio
template:
metadata:
labels:
app: gradio
spec:
containers:
- name: gradio
image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4
resources:
requests:
cpu: "512m"
memory: "512Mi"
limits:
cpu: "1"
memory: "512Mi"
env:
- name: CONTEXT_PATH
value: "/generate"
- name: HOST
value: "http://llm-service"
- name: LLM_ENGINE
value: "tgi"
- name: MODEL_ID
value: "falcon-40b-instruct"
- name: USER_PROMPT
value: "User: prompt"
- name: SYSTEM_PROMPT
value: "Assistant: prompt"
ports:
- containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
name: gradio-service
spec:
type: LoadBalancer
selector:
app: gradio
ports:
- port: 80
targetPort: 7860
We did mentioned that gradio would be able to talk to different llm. How does all the magic happened?
This gradio implementation uses REST based http post to llm endpoint (above) and that's what make it able to support llama, falcon and mistral.
A simple example would look something like this here.
You can have a look at the code here.
Comments