BART - Input format

Hi,
Due to recent code changes by @sshleifer, I am trying to understand what is desired for BART’s input for training and generation, and whether the codebase is reflecting it properly as I’ve encountered some inconsistencies.

I am assuming both src_ids and tgt_ids are encoded with a BART tokenizer, and therefore have the format of [bos, token1, token2, …, eos].
Looking at transformers/examples/seq2seq/finetune.py#L151
decoder_input_ids = shift_tokens_right(tgt_ids) means that eos will be the first token and bos will be the second token.
This has an effect on generation:

  1. We need decoder_start_token_id=eos_token_id.
  2. The first actually generated token (i.e. after decoder_start_token_id) will be bos.

Questions:

  1. The default value for decoder_start_token_id is missing from facebook/bart-base and facebook/bart-large-mnli, which means it falls back to bos. The other BART models have eos as their decoder_start_token_id. Why is the difference? Looks to me that using finetune.py with bart-base/bart-large-mnli will not have generation as intended.

  2. In fairseq’s implementation the equivalent for decoder_start_token_id is set to bos: fairseq/models/bart/hub_interface.py#L123. Can you please explain why did you decide to use the format of [eos, bos, token1, token2, ...] for decoder_input_ids instead of [bos, token1, token2, ...]?

  3. Is there still need for force_bos_token_to_be_generated?
    It was introduced in transformers/pull/6526 (new user, can’t add another link), when the first token of decoder_input_ids was bos and the second was the first regular token of the target sequence (transformers/examples/seq2seq/finetune.py#L144). Using it now shouldn’t have any effect, if I understand correctly (because a trained model will easily learn to always output bos in this position anyway).

Thanks!

2 Likes
  • This is a great question and super confusing.
  • You can get a good snapshot of my understanding here. I don’t think that will clear everything up, but if you could read that and let me know what you still don’t understand it would be helpful.
  • If I encounter empirical evidence that I should change decoder_start_token_id for any model I will do so.