Hi,
I am experimenting with huggingface transformers using sentence to sentence model (Helsinki_en_to_de, Helsinki_de_to_en) and no matter if I call the tokenizer and then model manually or in the pipeline if I do a few concurrent inferences the memory used by Uvicorn grows. The more concurrent requests the more dramatic the growth is. Memory is not released. Uvicorn starts well below 1 GB, after 1 model is loaded it goes up to 1.1 GB and then when I run inferences concurrently it can get to 6 GB (I stopped at this point).
I tried using smaller batches (max_lenght=128) - the memory still usage kept increasing, just slower
I tried to use tracemalloc and pympler to profile memory but they do not pinpoint any reason (that is no single collection of objects is growing that much). When using pympler a warning was printed:
Relevant parts of my code:
tokenized = tokenizer(sentences, return_tensors="pt", padding=True)
translated_encoded = model.generate(**tokenized)
translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated_encoded]
I would be grateful for any suggestions.