llama cpp running it in google colab

August 23, 2025

There are 2 options to do this in Google colab.

Option 1 (Easiest)

With this option, we can just install the relevant package and then run the model.

First we need to install the required packages using the following command:-

!pip install -U llama-cpp-python

Next, we will try to get the model using the following command

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
  repo_id="unsloth/Qwen3-0.6B-GGUF",
  filename="Qwen3-0.6B-IQ4_NL.gguf",
)

We can see the model being downloaded:-

And then we will run the following command to test this model

llm.create_chat_completion(
  messages = [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
)

And the output will look something like this:- 

Option 2

llama.cpp is a powerful inference engine and if you wanted to get it running in google colab, you come to the right place. So create your notebook and then build and install llama.cpp. This can take up to 5 minutes.

In the cell, run the following

!apt-get -qq install build-essential cmake
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!cmake -B build
!cmake --build build --config Release

Then you will get a bunch of binaries including llama-server.

If you run, !/content/llama.cpp/build/bin/llama-server -h

Then you will get the following outputs:-

And yes, it has aaaaaaalof of options - we're just scratching the surface of what llama.cpp can do here.

Next run the llama-server using the following command:-

!nohup /content/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M --host 0.0.0.0 --port 8000 &

This will run the server in the background. Next is where we issue curl command ask it some question

!curl http://localhost:8000/v1/chat/completions -H "Content-Type:

application/json" -d '{"model": "Qwen3-0.6B", 

"messages": [{"role": "user", "content": "Tell me a fun fact about space."}],

"temperature": 0.7}'

And we get some decent outputs:

Reference notebook

Option 1 - https://github.com/mitzenjeremywoo/google-colab-notebooks/blob/main/llama_cpp_python_gwen_google_colab.ipynb

Option 2 - https://github.com/mitzenjeremywoo/google-colab-notebooks/blob/main/llama_cpp_running_gwen_3_600k.ipynb

Search This Blog

mitzen

llama cpp running it in google colab

Comments

Popular posts from this blog

vllm : Failed to infer device type

android studio kotlin source is null error

gemini cli getting file not defined error