meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization

May 25, 2025

To run this on Google Colab, ensure you have the following runtime setup.

Ensure you have run the following command

!pip install -U bitsandbytes

This will basically updates existing library to use a CUDA versions of it.

And then to infer the model you can use the following code

from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

quant_config = BitsAndBytesConfig(load_in_8bit=True)

model_name = "meta-llama/Llama-3.2-3B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    quantization_config=quant_config
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "hello world"
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")

generated_ids = model.generate(**model_inputs, max_length=30)
tokenizer.batch_decode(generated_ids)[0]

The take away is meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization. More details can be found here.

Search This Blog

mitzen

meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization

Comments

Popular posts from this blog

vllm : Failed to infer device type

android studio kotlin source is null error

gemini cli getting file not defined error