Transformer for Translation from Scratch with Hugging Face/PyTorch

Greetings!

In recent weeks, I’ve been working with the transformers library to build a transformer model for translation from scratch. There have been similar topics, however, I couldn’t find a suitable answer for my issues.

Foremost, the big variety of models in the hub is amazing, but which model from the hub do you recommend for a simple transformer model for translation? So far, the T5 model and its tokenizer from the hub constitute the backbone of my project. (In the T5 paper, they say that the architecture is close to the original from Attention is all you need).

    # initialize pretrained tokenizer and model
    tokenizer = transformers.T5Tokenizer.from_pretrained(MODEL) 
    config = transformers.AutoConfig.from_pretrained(model_name) 
    model = transformers.T5ForConditionalGeneration(config).to(device)

Secondly, I cannot use the Trainer of the transformers library due to research related issues. Therefore, I need to write my own training routine in PyTorch. At first sight this seems to be a simple problem, but somehow my model does not significantly improve its BLEU score, even though I trained the model for many epochs on a common dataset like WMT16.

The training routine is as follows:

def train_epoch(model, train_dataloader, optimizer, lr_scheduler, CLIP):
    '''
    Trains model on the entire dataset for one epoch.
    ------------------------------------
    model (nn.model): Torch model
    train_dataloader (torch.dataloader): Dataloader
    optimizer (th.optim): Optimizer
    CLIP (int): Gradient Clipping
    ------------------------------------
    returns average epoch loss
    '''
    model.train()
    epoch_loss = 0
    for batch in tqdm(train_dataloader):
            src_ids = batch['src_ids'].to(device)
            trg_ids = batch['trg_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            loss = model(input_ids=src_ids, attention_mask=attention_mask, labels=trg_ids).loss     
            loss.backward()
            th.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            epoch_loss += loss.item()
    return epoch_loss / len(train_dataloader)

When I evaluate the above model with respect to the SacreBLEU score by using datasets.load_metric('sacrebleu'), I obtain values between 0.1 to 0.5 after 30-50 Epochs on WMT16. This clearly indicates that the model is not capable of translating despite the training.

I would appreciate if you could help me solve this problem! The source code can be viewed on GitHub - b-turan/transformer_pytorch: Implementation of Transformer for Neural Machine Translation in PyTorch and Hugging Face (Work in Progress).