CUDA OOM on model(inputs) but not on model.generate(inputs), but doesn't generate use model(inputs)?

using the same model inputs

kwargs = {
     "input_ids":input_ids,
     "attention_mask":attention_mask,
 }

i get a CUDA out of memory when i do

outputs = model(**kwargs)

but not when i do

output_ids = model.generate(
    **kwargs,
    do_sample=False,
    max_new_tokens=max_output_len,
    pad_token_id=tokenizer.eos_token_id,
    top_p=None,
)

doesnt model.generate do a model(**kwargs) like operation several times? why is my version so memory inefficient?

model is microsoft/phi-1_5 and transformers.version 4.40.1

Hi!

The reason is because on model.forward() you will have gradients being calculated unless you do:

with torch.no_grad():
    model.forward(**inputs)

The generate method already has the no gradient decorator so it does not use much memory :hugs:

1 Like

ah of course! thank you raushan! the community delivers

as a follow up tho, would model.eval() not do the same thing? as

with torch.no_grad():
    outputs = model.forward(**inputs)

print(outputs)

No, model.eval() is used to transform model layers that to eval mode, i.e. batchnorm or dropout layers.
And torch.no_grad() deactivates gradient calculation, saving memory usage

1 Like