Out of memory error when creating a lot of embeddings

I am trying to create embedded ngs from a large amount of text. I am doing it paragraph by paragraph in a loop. I always run out of gpu memory. Any idea why this is happening?

Code looks like this

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", torch_dtype=torch.float16)

def getSentenceEmbedding(sentenceText, languageModel, modelTokenizer):
    sentence_tokens = modelTokenizer(sentenceText, return_tensors="pt")
    sentence_input_ids = sentence_tokens.input_ids  #.to('cuda')
    encodings = languageModel.encoder(input_ids=sentence_input_ids, attention_mask=sentence_tokens.attention_mask, return_dict=True)
    del sentence_input_ids
    del sentence_tokens
    return torch.mean(encodings.last_hidden_state, dim=1)

for paragraph in text:

Hey @MultiModal

Wild unsubstantiated guess, assuming you can run a forward pass on your env with no problems: have you tried not having a function? There could be a problem with passing the model to the function and calling it repeatedly.

If that doesn’t solve it, we would need a stand-alone reproducibility script for us to try on our end :slight_smile:

Thanks for your response. I could try that but it seems the problem had to do with the Embeddings array apparently being references to objects in VRAM. I saved them individually in a file rather than putting them in an array. That was the first hint. No OOM errors when saved to files and not put in an array. Then it appears moving the embedding tensors.to(“cpu”) before stuffing in an array did the trick. But don’t take my word for it. Try it out and confirm. Saving the tensors in a file suited my purposes and that’s what I went with. I might be misremembering with the to(“cpu”). Let me know.

Thanks for helping. I really do appreciate it.