I just tried to train an EncoderDeoder model for summarization task based on pre-trained BanglaBERT, which is an ELECTRA discriminator model pre-trained with the Replaced Token Detection (RTD) objective. Surprisingly, after spending 4500 steps on 10k training data, the model wasn’t trained a bit since the ROUGE-2 scores were just 0.0000. To make sure I used that 4500-checkpoint to generate summaries for testing purposes; it generated a fixed-length (50) output (even if I change the test input of different lengths) containing the [CLS] token 49 times and a [SEP] token lastly. Basically, I followed the Warm-starting encoder-decoder models with Transformers notebook. Can anybody give any clue what could be the issue here? Thanks in advance.
In my case,
Tokenization BanglaBERT model:
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
Input pre-processing function:
def process_data_to_model_inputs(batch):
inputs = tokenizer(batch['text'], padding="max_length", truncation=True, max_length=encoder_max_length)
outputs = tokenizer(batch['summary'], padding="max_length", truncation=True, max_length=decoder_max_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["decoder_attention_mask"] = outputs.attention_mask
batch["labels"] = outputs.input_ids.copy()
batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
return batch
Mapping the pre-processing function to the batches of examples:
train_data = train_data.map(
process_data_to_model_inputs,
batched=True,
batch_size=batch_size,
remove_columns=["text", "summary"]
)
train_data.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
valid_data = valid_data.map(
process_data_to_model_inputs,
batched=True,
batch_size=batch_size,
remove_columns=["text", "summary"]
)
valid_data.set_format(
type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
BanglaBERT model and its config settings:
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("csebuetnlp/banglabert", "csebuetnlp/banglabert")
bert2bert.config.decoder_start_token_id = tokenizer.bos_token_id
bert2bert.config.eos_token_id = tokenizer.eos_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id
bert2bert.config.vocab_size = bert2bert.config.decoder.vocab_size
bert2bert.config.max_length = 128
bert2bert.config.min_length = 42
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 8
bert2bert.config.remove_invalid_values = True
bert2bert.config.repetition_penalty = 2.0
bert2bert.config.length_penalty = 2.0
I used the Seq2SeqTrainer for training. The Seq2SeqTrainingArguments were as follows:
evaluation_strategy = "steps",
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
predict_with_generate = True,
logging_steps = 1000,
save_steps = 500,
eval_steps = 5000,
warmup_steps = 500,
overwrite_output_dir = True,
save_total_limit = 2,
num_train_epochs = 20,
fp16 = True