Hello, I’m exploring methods to manage CUDA Out of Memory (OOM) errors during the inference of 70 billion parameter models without resorting to quantization. Specifically, I’m interested in leveraging CPU/disk offloading. Does the Accelerate library offer solutions for this?
I’ve examined the current documentation, which appears to focus on custom NLP models rather than facilitating the use of pretrained large models like LLaMA 2 70B. Here’s the guide I found: Hugging Face’s Accelerate Big Model Inference Guide.
Additionally, I’ve looked into FlexGen, but it seems the project is no longer active: FlexGen GitHub Repository.