Transformer for Translation from Scratch with Hugging Face/PyTorch


In recent weeks, I’ve been working with the transformers library to build a transformer model for translation from scratch. There have been similar topics, however, I couldn’t find a suitable answer for my issues.

Foremost, the big variety of models in the hub is amazing, but which model from the hub do you recommend for a simple transformer model for translation? So far, the T5 model and its tokenizer from the hub constitute the backbone of my project. (In the T5 paper, they say that the architecture is close to the original from Attention is all you need).

    # initialize pretrained tokenizer and model
    tokenizer = transformers.T5Tokenizer.from_pretrained(MODEL) 
    config = transformers.AutoConfig.from_pretrained(model_name) 
    model = transformers.T5ForConditionalGeneration(config).to(device)

Secondly, I cannot use the Trainer of the transformers library due to research related issues. Therefore, I need to write my own training routine in PyTorch. At first sight this seems to be a simple problem, but somehow my model does not significantly improve its BLEU score, even though I trained the model for many epochs on a common dataset like WMT16.

The training routine is as follows:

def train_epoch(model, train_dataloader, optimizer, lr_scheduler, CLIP):
    Trains model on the entire dataset for one epoch.
    model (nn.model): Torch model
    train_dataloader (torch.dataloader): Dataloader
    optimizer (th.optim): Optimizer
    CLIP (int): Gradient Clipping
    returns average epoch loss
    epoch_loss = 0
    for batch in tqdm(train_dataloader):
            src_ids = batch['src_ids'].to(device)
            trg_ids = batch['trg_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            loss = model(input_ids=src_ids, attention_mask=attention_mask, labels=trg_ids).loss     
            th.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
            epoch_loss += loss.item()
    return epoch_loss / len(train_dataloader)

When I evaluate the above model with respect to the SacreBLEU score by using datasets.load_metric('sacrebleu'), I obtain values between 0.1 to 0.5 after 30-50 Epochs on WMT16. This clearly indicates that the model is not capable of translating despite the training.

I would appreciate if you could help me solve this problem! The source code can be viewed on GitHub - b-turan/transformer_pytorch: Implementation of Transformer for Neural Machine Translation in PyTorch and Hugging Face (Work in Progress).

Same question! I want to reproduce the results of but have no idea how to train a translation model from scratch.

Hey victordiao,

I managed to train the T5 model from scratch on WMT-16 (English-German) with the help of the hugging face tutorial.

If you like, checkout the file in my repository, which is implemented in the fashion of the tutorial.

It’s short and condensed and should lead to some understanding of the underlying processes.

Hi b-turan,
Thanks very much!
May I ask what about the performance? is it as good as expected?

I was just interested in running a few experiments and did not focus on any hyperparameter tuning. Surely, one can achieve better results, but this just serves as a starting point, so I can’t say anything relevant about performance.

I see. Thanks for your reply!