meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization
To run this on Google Colab, ensure you have the following runtime setup.
Ensure you have run the following command
!pip install -U bitsandbytes
This will basically updates existing library to use a CUDA versions of it.
And then to infer the model you can use the following code
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
quant_config = BitsAndBytesConfig(load_in_8bit=True)
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
quantization_config=quant_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "hello world"
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_length=30)
tokenizer.batch_decode(generated_ids)[0]
The take away is meta-llama/Llama-3.2-3B-Instruct supports 8 bit quantization. More details can be found here.
Comments