I am new to using Huggingface documentation, so apologies if this is a silly question. I am training a BartDecoder from scratch. I have some other encoding method other than BartEncoder. To the BartDecoder, I pass in:
input_embeds: embeddings of the target token sequence
encoder_hidden_state: last hidden state of my custom encoder
attention_mask: a mask for the target token sequence for the pad tokens
I have also added a head to the decoder so it can output logits of size (batch_size, seq_len, 50265) (last number of the BART vocab size). From there, I use nn.CrossEntropyLoss(reduction = ‘none’) to compare the logits to the true class values. Each time, for each prediction, the loss output is always 0. I have checked, and the output logits always predict the correct word! I am not using a pretrained decoder, nor have I run a single learning step!
I believe I may have a misunderstanding about the attention masks. From the internal documentation it seems BartDecoder has a ._prepare_decoder_attention_mask() method, which I think should handle masking out future context for each prediction during a training step. But I am not sure. Does anyone have a solution to this issue?