Llama-2 on colab


I am trying to download llama-2 for text generation on google colab free version. I tried simply the following

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, token=True)

But this gives me an “ran out of RAM” error and the runtime crashes. I noticed that the GPU RAM wasn’t being used and the CPU RAM was going past the limit and causing the runtime to crash. I saw some potential solutions of trying to checkpoint online – I haven’t done this before so I have to learn how but will learn if that is useful. Are there any ways to successfully get this model running on colab. Additionally, as a more general question – How can I predict how much memory it takes to run a specific model?

Any advice is much appreciated. Thank you!

I think I am having the same issue Colab RAM Limit Exceeded: Unable to Run 3B Model Even with Quantization I will share a solution if I find it

by any chance you found something

You can use llama 2 in colab using 4 bit quantization this shorten the memory usage but this will not work without GPU below is the link:

To use the model below is the main code:

if torch.cuda.is_available():

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                            torch_dtype="auto", load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
tokenizer.use_default_system_prompt = False