I have a possibly silly question. I’m doing some experiments loading flan-t5-base from local storage. I’m trying to understand the tradeoffs of using 8bit quantization and using the CPU vs the GPU for inference. I’m seeing some (to me) weird behavior, but I’m sure this is down to some default values in the from_pretrained function. As a baseline, my system uses 800Mb of Ram and 200Mb of GPU memory I’m loading the model from disk like this. If I don’t set load_in_8bit specifically, I see 2.4Gb ram usage and 200Mb GPU memory usage.
model = AutoModelForSeq2SeqLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path)
If I set load_in_8bit=True, it only works with the GPU, but I see the usage of 3.76Gb of RAM and 1.8Gb of GPU memory. Setting the flag explicitly to false results in 3.76Gb Ram and 2.3Gb of GPU memory (that makes sense). It’s not clear to me why I only see 2.4Gb of Ram usage if I don’t set anything. Using load in 8bit results in a greater total (CPU+GPU) memory usage (3.76 + 1.8). Also, at least for this model, inference feels as fast in CPU as GPU. any reason to use GPU here? Thanks