Hi! Hugging Face blew my mind, it’s awesome, but I’m struggling to get better performance using my 1080ti: it is very low, at 3%, with CPU at around 30%.
At first I got the “UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset” warning so I switched to using a dataset. Now I get no warning but nothing seems to have changed. I guess I’m doing something wrong… here’s some sample code:
to_translate = [# Several Arabic Sentences]
dataset = Dataset.from_pandas(pd.DataFrame(to_translate ))
translator = pipeline(
'translation_ar_to_en',
model='Helsinki-NLP/opus-mt-ar-en',
device=0
)
def trans(ds):
ds['TRANSLATED'] = translator(ds['0'])
return ds
for out in tqdm(translator(KeyDataset(dataset, "0"))):
print(out)
I think I didn’t get the dataset part right yet…
Thanks!
Hi Iván! I think you need to use the batch_size= argument when you call your pipeline() in order to process more text sequences at the same time. Something like this should work:
for out in tqdm(translator(KeyDataset(dataset, "0"), batch_size=32)):
print(out)
You can experiment with different batch sizes to see which one gives you a better performance.
I hope this helps!
Hi everyone!
I do have a similar problem, working with a TranslationPipeline:
I have a Pandas-DataFrame with a collection of German texts and their English translation. Now I wanna backtranslate the English column with t5-large (if there is any more recommended model, feel free to tell me so ).
This is the code-snippet:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained('t5-large', model_max_length=1024)
model = AutoModelForSeq2SeqLM.from_pretrained('t5-large')
translator = pipeline(
'translation_en_to_de',
model=model,
tokenizer=tokenizer,
device=0,
batch_size=64,
)
# df is the DataFrame with columns ['de', 'en']
df['backtranslation'] = df['en'].progress_apply(translator)
Of course the backtranslation with around 600’000 rows takes quite some time.
But there is the UserWarning: You seem to be using the pipelines sequentially on GPU. as well and I can translate with about 3.5 it/s (with a RTX 2070 Super). So I think & hope I could do better here
I also tried putting the DataFrame in a dataset, but there it got even slower with: