Load_in_8bit vs. loading 8-bit quantized model

You could try load_in_4bit, may take the edge off the VRAM usage.

If you are able to share code snippet, we may be able to see if there are problems leading to unnecessary GPU usage