Yes. I also have this problem. I’m experimenting with llamas with 13 billion parameters in fp16, and if the context window is loaded to the maximum, then the memory can be almost twice as much as the model itself occupies in memory. But everywhere they write that the overhead is a maximum of 20 percent of the model.