BART - Input format

Hi,
Due to recent code changes by @sshleifer, I am trying to understand what is desired for BART’s input for training and generation, and whether the codebase is reflecting it properly as I’ve encountered some inconsistencies.

I am assuming both src_ids and tgt_ids are encoded with a BART tokenizer, and therefore have the format of [bos, token1, token2, …, eos].
Looking at transformers/examples/seq2seq/finetune.py#L151
decoder_input_ids = shift_tokens_right(tgt_ids) means that eos will be the first token and bos will be the second token.
This has an effect on generation:

  1. We need decoder_start_token_id=eos_token_id.
  2. The first actually generated token (i.e. after decoder_start_token_id) will be bos.

Questions:

  1. The default value for decoder_start_token_id is missing from facebook/bart-base and facebook/bart-large-mnli, which means it falls back to bos. The other BART models have eos as their decoder_start_token_id. Why is the difference? Looks to me that using finetune.py with bart-base/bart-large-mnli will not have generation as intended.

  2. In fairseq’s implementation the equivalent for decoder_start_token_id is set to bos: fairseq/models/bart/hub_interface.py#L123. Can you please explain why did you decide to use the format of [eos, bos, token1, token2, ...] for decoder_input_ids instead of [bos, token1, token2, ...]?

  3. Is there still need for force_bos_token_to_be_generated?
    It was introduced in transformers/pull/6526 (new user, can’t add another link), when the first token of decoder_input_ids was bos and the second was the first regular token of the target sequence (transformers/examples/seq2seq/finetune.py#L144). Using it now shouldn’t have any effect, if I understand correctly (because a trained model will easily learn to always output bos in this position anyway).

Thanks!

2 Likes
  • This is a great question and super confusing.
  • You can get a good snapshot of my understanding here. I don’t think that will clear everything up, but if you could read that and let me know what you still don’t understand it would be helpful.
  • If I encounter empirical evidence that I should change decoder_start_token_id for any model I will do so.

I hope you folks don’t mind me reviving this old topic, but it seems relevant to my issues with BOS and EOS.

I am running different training experiments with different input formats for a facebook/bart-base model augmented by unlimiformer (retrieval augmented to store larger documents).

The standard format is a Long Document (LD): a single string.
The second format is a Knowledge Graph of the LD: a single string, but with each relation in the graph entered as a sequence "<s> rel1 </s><s> rel2 </s> ... <s> reln </s>"
The target is a summary: a single string.

The odd thing is that, in my initial conda environment, due to constraints suggested by the unlimiformer authors, I ended up with transformers 4.34.1. Somehow this got corrupted and I built a new environment and since the constraints seemed unsatisfiable, I removed all and ended up with transformers 4.35.2.

The difference was stark.
With 4.34.1, LDs result in summaries of length 70-130. In contrast KGs result in summaries of length 600-1000: with lots of lists. Moreover, although the KG inputs are about one-fifth the length, the summaries took much longer to generate.
With 4.35.2, both summaries were very similar in length and training is twice as fast (12 hours to get to 21k steps).

Also, not sure what role the max_length=128 and generation_max_length=1024 parameter play here as results seem to violate the max_length constraint.

If you feel this is unrelated, I’m happy to open a new post.

I found that the add_special_tokens=False setting seems to be driving the above behaviour. And this is dependent on conda environment (for a fixed Transformer version 1.34.1).

Hey, has anyone got any answers to the above? Thanks!