I am currently trying the new model stabilityai/stablecode-completion-alpha-3b on a free colab notebook with gpu (12 gigabyte in system ram and 14 gigabyte in gpu ram T4)
this the code I am using :
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablecode-completion-alpha-3b")
model = AutoModelForCausalLM.from_pretrained(
"stabilityai/stablecode-completion-alpha-3b",
trust_remote_code=True,
load_in_8bit=True,
device="cuda",
)
as soon as the model is starting to load the sys RAM starts to fill until I get the warning that the RAM has been exausted and the envirement is restarted. I don’t know why the model is being loaded to sys RAM instead of the gpu RAM, and the 12 gigabytes of RAM is used up for a quantized 3b model