Big Model Inference: CPU/Disk Offloading for Transformers Using from_pretrained

Hello, I’m exploring methods to manage CUDA Out of Memory (OOM) errors during the inference of 70 billion parameter models without resorting to quantization. Specifically, I’m interested in leveraging CPU/disk offloading. Does the Accelerate library offer solutions for this?

I’ve examined the current documentation, which appears to focus on custom NLP models rather than facilitating the use of pretrained large models like LLaMA 2 70B. Here’s the guide I found: Hugging Face’s Accelerate Big Model Inference Guide.

Additionally, I’ve looked into FlexGen, but it seems the project is no longer active: FlexGen GitHub Repository.

Yes. Just set device_map="auto" in your call and it’ll do this automatically

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.