As I read here dataset splits into num_proc
parts and each part processes separately:
When
num_proc
> 1,map
splits the dataset intonum_proc
shards, each of which is mapped to one of thenum_proc
workers. So in your case, this means that some workers finished processing their shards earlier than others.
Here is my code:
def _get_embeddings(texts):
encoded_input = tokenizer(
texts, padding=True, truncation=True, return_tensors='pt'
)
with torch.no_grad():
encoded_input = {
k: v.to('cpu') for k, v in encoded_input.items()
}
model_output = model(**encoded_input)
return model_output.pooler_output.tolist()
checkpoint = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)
texts = [
{ 'text': 'first sentence' },
{ 'text': 'second sentence' },
# ...
{ 'text': 'billionth sentence' }
]
dataset = Dataset.from_list(texts)
predictions = dataset.map(
lambda x: {
'embeddings': _get_embeddings(x['text'])
},
batched=True,
batch_size=32,
num_proc=4
)
Am I right that the code below will consume 4x more memory?
I mean how exactly this code should work? The model will be somehow copied to 4 Python processes and this will consume four times more RAM and I’ll get overhead time costs? Or all 4 Python processes have access to “shared” model?