Problem generating with T5ForConditionalGeneration on a custom task

fedeotto · January 26, 2025, 1:32pm

Hi, I have trained a custom version of the T5 transformer for a text2text task. For this, I had to define two tokenizers , tokenizer_src and tokenizer_tgt as different tokenization strategies were required for inputs and outputs and as a consequence different vocabularies are utilized. After training the model, I’m trying to perform evaluation by generating output strings from inputs. Below the code I’m using:

import torch
import torch.nn as nn
import json
import pandas as pd
from types import SimpleNamespace
from tokenizers import Tokenizer
from config import get_config
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config

def evaluate(config,
             ckpt_path='ckpts/model_ckpt.pt',
             test_path='data/test.csv',
             tokenizer_src_path = 'tokenizer_src.json', 
             tokenizer_tgt_path = 'tokenizer_tgt.json',
             max_length = 512,
             num_beams = 4):

    t5_config = T5Config(
        num_layers=config['num_layers'], 
        num_decoder_layers=config['num_decoder_layers'],
        d_model=config['d_model'],
        d_ff=config['d_ff'],  
        num_heads=config['num_heads'],
    )
    
    #Load tokenizers
    tokenizer_src = Tokenizer.from_file(tokenizer_src_path)
    tokenizer_tgt = Tokenizer.from_file(tokenizer_tgt_path)

    vocab_src_len = len(tokenizer_src.get_vocab())
    vocab_tgt_len = len(tokenizer_tgt.get_vocab())
    model = T5ForConditionalGeneration(t5_config)

    #Adjusting layers to match the tokenizers vocab size
    model.get_encoder().embed_tokens = nn.Embedding(vocab_src_len, t5_config.d_model)
    model.get_decoder().embed_tokens = nn.Embedding(vocab_tgt_len, t5_config.d_model)

    # Replace the lm_head with a new linear layer. This is responsible for mapping the decoder's output embeddings to the vocabulary space, which is necessary for generating predictions.
    model.lm_head = nn.Linear(t5_config.d_model, vocab_tgt_len, bias=False)

    ckpt = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt['model_state_dict'])

    df_test = pd.read_csv(test_path)
    df_test = df_test[['inputs',config['out_name']]]

    pred_spec  = []
    gt_spec    = df_test[config['out_name']].tolist()
    input_mols = df_test['inputs'].tolist()

    # We need to define a generation config here!!
    for input_text in inputs:
        input_ids = tokenizer_src.encode(input_text).ids
        input_ids = torch.tensor([input_ids]).to(model.device)

        with torch.no_grad():
            outputs = model.generate(input_ids,
                                    max_length=50,  # Maximum length of the generated text
                                    min_length=10,  # Minimum length of the generated text
                                    length_penalty=2.0,  # Penalise long sentences
                                    num_beams=4,  # Use beam search for better results
                                    early_stopping=True)  # Stop once output stabilises)
    
        pred_text = tokenizer_tgt.decode(outputs[0], skip_special_tokens=True)
        pred_spec.append(pred_text)

if __name__ == '__main__':
    config = get_config()
    evaluate(config)

I’ve also modified the original T5 model by reducing the complexity in its layers (and making it fit into my GPU). What I get is the following error:

"ValueError: decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation."

I’ve tried to set the decoder_start_token_id manually by doing:

model.config.decoder_start_token_id = tokenizer_tgt.token_to_id(bos_token)

Actually, I have noticed that by not passing the generation_config argument in the generate() method, the decoder_start_token_id will simply be initialized with None causing the error. I’m just not sure how should I proceed.

I’m afraid I might have missed out something as my task requires some levels of customizations (custom tokenizers, different sizes for input and output embeddings etc.). I appreciate any suggestions.

John6666 · January 26, 2025, 1:41pm

There seems to have been a problem with the output of T5 in a newer version of Transformers a while ago, so if it’s related to that, it’s easy…

github.com/huggingface/transformers

transformers 4.41.0 breaks generate() for T5

opened 07:06PM - 18 May 24 UTC

closed 04:15PM - 22 May 24 UTC

abdulfatir

Should Fix Generation

### System Info - `transformers` version: 4.41.0 - Platform: Linux-5.15.0-1033…-aws-x86_64-with-glibc2.31 - Python version: 3.10.9 - Huggingface_hub version: 0.23.0 - Safetensors version: 0.4.3 - Accelerate version: 0.30.0 - Accelerate config: not found - PyTorch version (GPU?): 2.3.0+cu121 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: No ### Who can help? @ArthurZucker and @younesbelkada ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction The following code breaks in `v4.41.0` (it works on earlier versions). ```py import torch from transformers import GenerationConfig from transformers import T5ForConditionalGeneration model = T5ForConditionalGeneration.from_pretrained( "google/t5-efficient-tiny", device_map="cuda" ) input_ids = torch.tensor([[4, 5, 6, 6, 7]], device="cuda") model.generate( input_ids=input_ids, generation_config=GenerationConfig(do_sample=True), ) ``` Error: ``` ValueError: `decoder_start_token_id` or `bos_token_id` has to be defined for encoder-decoder generation. ``` ### Expected behavior Expected generate to work like before without manually specifying `decoder_start_token_id` or `bos_token_id` in the `GenerationConfig`.

fedeotto · January 26, 2025, 1:44pm

Thanks! I’m gonna try to manually specify the GenerationConfig and update the topic soon!

Topic		Replies	Views
How to use T5ForConditionalGeneration to train your custom model? Beginners	0	264	September 27, 2022
Unable to use model.generate for Vision encoder decoder model Beginners	3	1119	March 6, 2024
Can t5 be used to text-generation? Beginners	7	8820	April 26, 2023
T5Model predict <UNK> Beginners	0	223	October 5, 2022
How to use inputs_embeds in generate()? 🤗Transformers	5	5627	July 8, 2023

Problem generating with T5ForConditionalGeneration on a custom task

Related topics