meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization

To run this on Google Colab, ensure you have the following runtime setup. 



Ensure you have run the following command 

!pip install -U bitsandbytes

 This will basically updates existing library to use a CUDA versions of it. 

And then to infer the model you can use the following code


from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

quant_config = BitsAndBytesConfig(load_in_8bit=True)

model_name = "meta-llama/Llama-3.2-3B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    quantization_config=quant_config
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "hello world"
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")

generated_ids = model.generate(**model_inputs, max_length=30)
tokenizer.batch_decode(generated_ids)[0]

The take away is meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization.  More details can be found here.


Comments

Popular posts from this blog

gemini cli getting file not defined error

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20

vllm : Failed to infer device type