Hi, I have trained a custom version of the T5 transformer for a text2text task. For this, I had to define two tokenizers , tokenizer_src
and tokenizer_tgt
as different tokenization strategies were required for inputs and outputs and as a consequence different vocabularies are utilized. After training the model, I’m trying to perform evaluation by generating output strings from inputs. Below the code I’m using:
import torch
import torch.nn as nn
import json
import pandas as pd
from types import SimpleNamespace
from tokenizers import Tokenizer
from config import get_config
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config
def evaluate(config,
ckpt_path='ckpts/model_ckpt.pt',
test_path='data/test.csv',
tokenizer_src_path = 'tokenizer_src.json',
tokenizer_tgt_path = 'tokenizer_tgt.json',
max_length = 512,
num_beams = 4):
t5_config = T5Config(
num_layers=config['num_layers'],
num_decoder_layers=config['num_decoder_layers'],
d_model=config['d_model'],
d_ff=config['d_ff'],
num_heads=config['num_heads'],
)
#Load tokenizers
tokenizer_src = Tokenizer.from_file(tokenizer_src_path)
tokenizer_tgt = Tokenizer.from_file(tokenizer_tgt_path)
vocab_src_len = len(tokenizer_src.get_vocab())
vocab_tgt_len = len(tokenizer_tgt.get_vocab())
model = T5ForConditionalGeneration(t5_config)
#Adjusting layers to match the tokenizers vocab size
model.get_encoder().embed_tokens = nn.Embedding(vocab_src_len, t5_config.d_model)
model.get_decoder().embed_tokens = nn.Embedding(vocab_tgt_len, t5_config.d_model)
# Replace the lm_head with a new linear layer. This is responsible for mapping the decoder's output embeddings to the vocabulary space, which is necessary for generating predictions.
model.lm_head = nn.Linear(t5_config.d_model, vocab_tgt_len, bias=False)
ckpt = torch.load(ckpt_path, map_location='cpu')
model.load_state_dict(ckpt['model_state_dict'])
df_test = pd.read_csv(test_path)
df_test = df_test[['inputs',config['out_name']]]
pred_spec = []
gt_spec = df_test[config['out_name']].tolist()
input_mols = df_test['inputs'].tolist()
# We need to define a generation config here!!
for input_text in inputs:
input_ids = tokenizer_src.encode(input_text).ids
input_ids = torch.tensor([input_ids]).to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids,
max_length=50, # Maximum length of the generated text
min_length=10, # Minimum length of the generated text
length_penalty=2.0, # Penalise long sentences
num_beams=4, # Use beam search for better results
early_stopping=True) # Stop once output stabilises)
pred_text = tokenizer_tgt.decode(outputs[0], skip_special_tokens=True)
pred_spec.append(pred_text)
if __name__ == '__main__':
config = get_config()
evaluate(config)
I’ve also modified the original T5 model by reducing the complexity in its layers (and making it fit into my GPU). What I get is the following error:
"ValueError: decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation."
I’ve tried to set the decoder_start_token_id
manually by doing:
model.config.decoder_start_token_id = tokenizer_tgt.token_to_id(bos_token)
Actually, I have noticed that by not passing the generation_config
argument in the generate()
method, the decoder_start_token_id
will simply be initialized with None
causing the error. I’m just not sure how should I proceed.
I’m afraid I might have missed out something as my task requires some levels of customizations (custom tokenizers, different sizes for input and output embeddings etc.). I appreciate any suggestions.