Hi all
to test and debug our PEFT script for very large models (405B), we want to circumvent downloading the whole model on every rerun, in order to reduce experimentation time (and thus, cost). Downloading once and storing is not an option, unfortunately.
To this end, we want to have the model loaded as if it was properly initialized (using e.g. from_pretrained), i.e. with the correct dtype and distributed across the gpus/nodes, and so on, just with random weights for now, and skipping the download.
When using something like
mconfig = AutoConfig.from_pretrained(train_config.model_name)
model = LlamaForCausalLM(mconfig)
the model gets loaded to CPU and the RAM runs out and we get an OOMKilled error.
So, ideally, we would want to use something like
model = LlamaForCausalLM.from_pretrained(
train_config.model_name, \
device_map= "auto", \
torch_dtype=torch.float16, # or torch.bfloat16,
random_weights=True
)
since the from_pretrained
function seems to handle all the complexities we need.
Is there any easy way to do this? We checked the parameters of from_pretrained, but did not find something applicable.
Thank you for any ideas,
Jonas