Using MarianModel's in pytorch is too slow to do back translation (not parallelised correctly)

nima · December 19, 2020, 10:20am

Hi.
I’m trying to use MarianModels for back translation as data augmentation. However, it’s too slow even using multiple GPUs. and I also can not use a batch size larger than 16 setting the max length to 300 though. Indeed it takes one day to complete half an epoch.
following is the code I’m using

target_langs = ['fr,wa,frp,oc,ca,rm,lld,fur,lij,lmo,es,pt,gl,lad,an,mwl,it,co,nap,scn,vec,sc,ro,la']

def translate(texts, model, tokenizer, language="fr"):
    
    with torch.no_grad():
        template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
        src_texts = [template(text) for text in texts]
        encoded = tokenizer.prepare_seq2seq_batch(src_texts, 
                                                      truncation=True, 
                                                      max_length=300, return_tensors="pt").to(device)   
        translated = model.module.generate(**encoded).to(device)
        translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
        return translated_texts


def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts



target_model_name = 'Helsinki-NLP/opus-mt-en-de'
target_tokenizer = MarianTokenizer.from_pretrained(target_model_name)
target_model = MarianMTModel.from_pretrained(target_model_name)

en_model_name = 'Helsinki-NLP/opus-mt-de-en'
en_tokenizer = MarianTokenizer.from_pretrained(en_model_name)
en_model = MarianMTModel.from_pretrained(en_model_name)

target_model = nn.DataParallel(target_model)    
target_model = target_model.to(device) # same performance  if I add .half()
target_model.eval()

en_model = nn.DataParallel(en_model)    
en_model = en_model.to(device)# same performance if I add .half()
en_model.eval()

## x1 and x2 are batches of strings. 
bk_x1 = back_translate(x1, source_lang="en", target_lang=np.random.choice(target_langs))   
bk_x2 = back_translate(x2, source_lang="en", target_lang=np.random.choice(target_langs))

here are GPU’s performances: low utilization due to small batch size 16 but if I increase the batch size I got Cuda out of memory error. also, I can see only one gpu is used for processing so might be that the Marian model can not be parallelized correctly. if so what would be the solution?

 |-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 42%   78C    P2   199W / 250W |   9777MiB / 11178MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 29%   36C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 31%   36C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 35%   41C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 29%   34C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:3F:00.0 Off |                  N/A |
| 30%   31C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:40:00.0 Off |                  N/A |
| 31%   38C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
| 30%   37C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     58780      C   python                          10407MiB |
|    1   N/A  N/A     58780      C   python                              0MiB |
|    2   N/A  N/A     58780      C   python                              0MiB |
|    3   N/A  N/A     58780      C   python                              0MiB |
|    4   N/A  N/A     58780      C   python                              0MiB |
|    5   N/A  N/A     58780      C   python                              0MiB |
|    6   N/A  N/A     58780      C   python                              0MiB |
|    7   N/A  N/A     58780      C   python                              0MiB |
+-----------------------------------------------------------------------------+

FYI: I’m using
pytorch 1. 1.7.0
transformers 4.0.1
cudda 10.1

BramVanroy · December 19, 2020, 10:34am

The problem might be related to the tokenizer rather than the model. The MarianTokenizer does not have a rust (fast) implementation which lay cause a bottleneck, no matter how many GPUs you use. It might be a good idea to preprocess (tokenizer) your dataset once and use datasets for on-the fly superfast access to that cached dataset.

nima · December 19, 2020, 10:55am

Thank’s Bram for your reply. but I already tried that no performance increase. still, only one GPU is utilized and the same slow performance. In addition, it’s only can be done for the forward translation but not for the backward translation

BramVanroy · December 19, 2020, 12:03pm

cc @patrickvonplaten

valhalla · December 21, 2020, 7:51am

This seems to be the issue with your parallelization code. There is a distributed evaluation scripts for seq2seq models here, you could try to modify it for back-translation

nima · December 21, 2020, 8:41pm

thanks for your reply. I checked the examples. basically, they are use cases for distributed training ( on different machines) but my problem is that the batch is not parallelized over multiple GPUs.

valhalla · December 22, 2020, 6:18am

the run_distributed_eval.py script does distribute evaluation, it allows you to do generation on multiple gpu’s. You can find the script here

nima · December 22, 2020, 10:05am

I already saw the example but the example is for data distributed parallel which is for distributed training on different machines but in my case, I have multiple gpu’s on a single machine for which we usually use pytorch nn.DataParallel() as I used in my code. the problem is that the pre-trained MarianModel is not parallelized over 8 gpu’s as it’s shown in nvidia-smi output.

BramVanroy · December 22, 2020, 3:47pm

That’s not correct. On a single machine with multiple GPUs you often want to opt for DDP, too, as it performs much better than DP. You can check the official pytorch documentation for this if you’re interested, but if you can, I always advise to use DDP for true parallelism (true multiprocessing rather than single process multithreading with GIL slow down).

Alse see: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Topic		Replies	Views
Slow inference for translation Beginners	0	180	April 22, 2024
Speeding up the inference for marian MT 🤗Transformers	4	2755	April 8, 2024
Caching issues with MarianMT Beginners	0	22	November 1, 2024
NLP Pretrained model model doesn’t use GPU when making inference 🤗Transformers	11	10121	March 11, 2022
How to process/translate in batches as to not get "CUDA out of memory" error Beginners	1	648	June 24, 2022

Using MarianModel's in pytorch is too slow to do back translation (not parallelised correctly)

Related topics