Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT

I just tried to train an EncoderDeoder model for summarization task based on pre-trained BanglaBERT, which is an ELECTRA discriminator model pre-trained with the Replaced Token Detection (RTD) objective. Surprisingly, after spending 4500 steps on 10k training data, the model wasn’t trained a bit since the ROUGE-2 scores were just 0.0000. To make sure I used that 4500-checkpoint to generate summaries for testing purposes; it generated a fixed-length (50) output (even if I change the test input of different lengths) containing the [CLS] token 49 times and a [SEP] token lastly. Basically, I followed the Warm-starting encoder-decoder models with :hugs:Transformers notebook. Can anybody give any clue what could be the issue here? Thanks in advance.

In my case,

Tokenization BanglaBERT model:

tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

Input pre-processing function:

def process_data_to_model_inputs(batch):
    inputs = tokenizer(batch['text'], padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer(batch['summary'], padding="max_length", truncation=True, max_length=decoder_max_length)

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["decoder_input_ids"] = outputs.input_ids
    batch["decoder_attention_mask"] = outputs.attention_mask
    batch["labels"] = outputs.input_ids.copy()

    batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
    return batch

Mapping the pre-processing function to the batches of examples:

train_data = train_data.map(
    process_data_to_model_inputs, 
    batched=True,
    batch_size=batch_size,
    remove_columns=["text", "summary"]
)
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

valid_data = valid_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size,
    remove_columns=["text", "summary"]
)
valid_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

BanglaBERT model and its config settings:

bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("csebuetnlp/banglabert", "csebuetnlp/banglabert")
bert2bert.config.decoder_start_token_id = tokenizer.bos_token_id
bert2bert.config.eos_token_id = tokenizer.eos_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id

bert2bert.config.vocab_size = bert2bert.config.decoder.vocab_size
bert2bert.config.max_length = 128
bert2bert.config.min_length = 42
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 8
bert2bert.config.remove_invalid_values = True
bert2bert.config.repetition_penalty = 2.0
bert2bert.config.length_penalty = 2.0

I used the Seq2SeqTrainer for training. The Seq2SeqTrainingArguments were as follows:

    evaluation_strategy = "steps",
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    predict_with_generate = True,
    logging_steps = 1000, 
    save_steps = 500, 
    eval_steps = 5000, 
    warmup_steps = 500,
    overwrite_output_dir = True,
    save_total_limit = 2,
    num_train_epochs = 20,
    fp16 = True

see here encoder-decoder (bert2bert) model for summarization task doesn't work in v4.18 · Issue #292 · huggingface/blog · GitHub