Problem generating with T5ForConditionalGeneration on a custom task

Hi, I have trained a custom version of the T5 transformer for a text2text task. For this, I had to define two tokenizers , tokenizer_src and tokenizer_tgt as different tokenization strategies were required for inputs and outputs and as a consequence different vocabularies are utilized. After training the model, I’m trying to perform evaluation by generating output strings from inputs. Below the code I’m using:

import torch
import torch.nn as nn
import json
import pandas as pd
from types import SimpleNamespace
from tokenizers import Tokenizer
from config import get_config
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config

def evaluate(config,
             ckpt_path='ckpts/model_ckpt.pt',
             test_path='data/test.csv',
             tokenizer_src_path = 'tokenizer_src.json', 
             tokenizer_tgt_path = 'tokenizer_tgt.json',
             max_length = 512,
             num_beams = 4):

    t5_config = T5Config(
        num_layers=config['num_layers'], 
        num_decoder_layers=config['num_decoder_layers'],
        d_model=config['d_model'],
        d_ff=config['d_ff'],  
        num_heads=config['num_heads'],
    )
    
    #Load tokenizers
    tokenizer_src = Tokenizer.from_file(tokenizer_src_path)
    tokenizer_tgt = Tokenizer.from_file(tokenizer_tgt_path)

    vocab_src_len = len(tokenizer_src.get_vocab())
    vocab_tgt_len = len(tokenizer_tgt.get_vocab())
    model = T5ForConditionalGeneration(t5_config)

    #Adjusting layers to match the tokenizers vocab size
    model.get_encoder().embed_tokens = nn.Embedding(vocab_src_len, t5_config.d_model)
    model.get_decoder().embed_tokens = nn.Embedding(vocab_tgt_len, t5_config.d_model)

    # Replace the lm_head with a new linear layer. This is responsible for mapping the decoder's output embeddings to the vocabulary space, which is necessary for generating predictions.
    model.lm_head = nn.Linear(t5_config.d_model, vocab_tgt_len, bias=False)

    ckpt = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt['model_state_dict'])

    df_test = pd.read_csv(test_path)
    df_test = df_test[['inputs',config['out_name']]]

    pred_spec  = []
    gt_spec    = df_test[config['out_name']].tolist()
    input_mols = df_test['inputs'].tolist()

    # We need to define a generation config here!!
    for input_text in inputs:
        input_ids = tokenizer_src.encode(input_text).ids
        input_ids = torch.tensor([input_ids]).to(model.device)

        with torch.no_grad():
            outputs = model.generate(input_ids,
                                    max_length=50,  # Maximum length of the generated text
                                    min_length=10,  # Minimum length of the generated text
                                    length_penalty=2.0,  # Penalise long sentences
                                    num_beams=4,  # Use beam search for better results
                                    early_stopping=True)  # Stop once output stabilises)
    
        pred_text = tokenizer_tgt.decode(outputs[0], skip_special_tokens=True)
        pred_spec.append(pred_text)

if __name__ == '__main__':
    config = get_config()
    evaluate(config)

I’ve also modified the original T5 model by reducing the complexity in its layers (and making it fit into my GPU). What I get is the following error:

"ValueError: decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation."

I’ve tried to set the decoder_start_token_id manually by doing:

model.config.decoder_start_token_id = tokenizer_tgt.token_to_id(bos_token)

Actually, I have noticed that by not passing the generation_config argument in the generate() method, the decoder_start_token_id will simply be initialized with None causing the error. I’m just not sure how should I proceed.

I’m afraid I might have missed out something as my task requires some levels of customizations (custom tokenizers, different sizes for input and output embeddings etc.). I appreciate any suggestions.

1 Like

There seems to have been a problem with the output of T5 in a newer version of Transformers a while ago, so if it’s related to that, it’s easy…:thinking:

1 Like

Thanks! I’m gonna try to manually specify the GenerationConfig and update the topic soon!

1 Like