If you're using hugging face and getting the error above, technically it means your runtime requires more cpu. You could try to get around it using the following example as a settings = basically settinglow_cpu_mem_usage = false tokenizer = LlamaTokenizer.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, low_cpu_mem_usage = False ) Sometimes the code of the issue is because you specify device_map to auto. This will turn on accelerate library. Setting it to 0 means placing your module on to a gpu. For example, I configure my device_map to auto which will throw more exception if I run it on Google colab using free runtime tokenizer = LlamaTokenizer.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map= 'auto' , low_cpu_mem_usage = False )