From my understanding, one of the uses of deepspeed
is to allocate the model across different GPUs, meaning avoid to load the model on a single GPU RAM + CPU RAM and distribute it instead.
Lets take a case where a single machine cannot load the model (including GPU and CPU RAMs) but 2 or more can load the model in combination:
Since, deepspeed
is integrated into the pipeline after the model is loaded. i.e. after running AutoModelForCausalLM.from_pretrained
leading to an OOM error, doesnt it defeat the purpose of using deepspeed
in this case?
Is there an alternative way to load the model in such cases (except downgrading precision)?