Question about memory usage

mlashcorp · May 15, 2023, 1:31pm

I have a possibly silly question. I’m doing some experiments loading flan-t5-base from local storage. I’m trying to understand the tradeoffs of using 8bit quantization and using the CPU vs the GPU for inference. I’m seeing some (to me) weird behavior, but I’m sure this is down to some default values in the from_pretrained function. As a baseline, my system uses 800Mb of Ram and 200Mb of GPU memory I’m loading the model from disk like this. If I don’t set load_in_8bit specifically, I see 2.4Gb ram usage and 200Mb GPU memory usage.

model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

If I set load_in_8bit=True, it only works with the GPU, but I see the usage of 3.76Gb of RAM and 1.8Gb of GPU memory. Setting the flag explicitly to false results in 3.76Gb Ram and 2.3Gb of GPU memory (that makes sense). It’s not clear to me why I only see 2.4Gb of Ram usage if I don’t set anything. Using load in 8bit results in a greater total (CPU+GPU) memory usage (3.76 + 1.8). Also, at least for this model, inference feels as fast in CPU as GPU. any reason to use GPU here? Thanks

Topic		Replies	Views
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6592	May 13, 2024
The CPU memory usage becomes very small during model inference 🤗Transformers	0	46	November 30, 2024
Memory Usage for Inference Depending on Size of Input Data 🤗Transformers	1	4428	September 18, 2023
The memory usage about inference on CPU Beginners	0	18	December 2, 2024
How is memory managed when loading a model? Beginners	2	6204	July 4, 2023

Question about memory usage

Related topics