Can't load fine tuned LLamav2 7b

I fined tuned a Llama 2 7b model and uploaded it to hugging face but now when will load it in google colab I ran out of system ram. (fine tuned model: Stoemb/llama-2-7b-html2text)

I loaded the model as followed:
model_name = “Stoemb/llama-2-7b-html2text”
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False

I’m still learning myself but i have been playing with a Llama 2 7B model in free Colab and I’ve found i need to set the device_map to “auto” as well as loading in 4 or 8 bit.

Don’t know if this is still an issue for you, but I had the same problem first time I fine-tuned the llama2 and tried to reload it.
The problem is that the default shard size when pushed to hub is 10GB, which is too much for the T4 GPU.

You can read more here:

To solve this you have to specify a smaller shard size as in the example below.

!huggingface-cli login
model.push_to_hub(your_model_name, max_shard_size='2GB')
tokenizer.push_to_hub(your_model_name)

This solved the problem I had at least, which sounds a lot similar to yours.