BART - Input format

eladsegal · September 11, 2020, 10:01am

Hi,
Due to recent code changes by @sshleifer, I am trying to understand what is desired for BART’s input for training and generation, and whether the codebase is reflecting it properly as I’ve encountered some inconsistencies.

I am assuming both src_ids and tgt_ids are encoded with a BART tokenizer, and therefore have the format of [bos, token1, token2, …, eos].
Looking at transformers/examples/seq2seq/finetune.py#L151
decoder_input_ids = shift_tokens_right(tgt_ids) means that eos will be the first token and bos will be the second token.
This has an effect on generation:

We need decoder_start_token_id=eos_token_id.
The first actually generated token (i.e. after decoder_start_token_id) will be bos.

Questions:

The default value for decoder_start_token_id is missing from facebook/bart-base and facebook/bart-large-mnli, which means it falls back to bos. The other BART models have eos as their decoder_start_token_id. Why is the difference? Looks to me that using finetune.py with bart-base/bart-large-mnli will not have generation as intended.
In fairseq’s implementation the equivalent for decoder_start_token_id is set to bos: fairseq/models/bart/hub_interface.py#L123. Can you please explain why did you decide to use the format of [eos, bos, token1, token2, ...] for decoder_input_ids instead of [bos, token1, token2, ...]?
Is there still need for force_bos_token_to_be_generated?
It was introduced in transformers/pull/6526 (new user, can’t add another link), when the first token of decoder_input_ids was bos and the second was the first regular token of the target sequence (transformers/examples/seq2seq/finetune.py#L144). Using it now shouldn’t have any effect, if I understand correctly (because a trained model will easily learn to always output bos in this position anyway).

Thanks!

sshleifer · September 11, 2020, 3:02pm

This is a great question and super confusing.
You can get a good snapshot of my understanding here. I don’t think that will clear everything up, but if you could read that and let me know what you still don’t understand it would be helpful.
If I encounter empirical evidence that I should change decoder_start_token_id for any model I will do so.

patrickocal · December 11, 2023, 4:00am

I hope you folks don’t mind me reviving this old topic, but it seems relevant to my issues with BOS and EOS.

I am running different training experiments with different input formats for a facebook/bart-base model augmented by unlimiformer (retrieval augmented to store larger documents).

The standard format is a Long Document (LD): a single string.
The second format is a Knowledge Graph of the LD: a single string, but with each relation in the graph entered as a sequence "<s> rel1 </s><s> rel2 </s> ... <s> reln </s>"
The target is a summary: a single string.

The odd thing is that, in my initial conda environment, due to constraints suggested by the unlimiformer authors, I ended up with transformers 4.34.1. Somehow this got corrupted and I built a new environment and since the constraints seemed unsatisfiable, I removed all and ended up with transformers 4.35.2.

The difference was stark.
With 4.34.1, LDs result in summaries of length 70-130. In contrast KGs result in summaries of length 600-1000: with lots of lists. Moreover, although the KG inputs are about one-fifth the length, the summaries took much longer to generate.
With 4.35.2, both summaries were very similar in length and training is twice as fast (12 hours to get to 21k steps).

Also, not sure what role the max_length=128 and generation_max_length=1024 parameter play here as results seem to violate the max_length constraint.

If you feel this is unrelated, I’m happy to open a new post.

patrickocal · December 13, 2023, 1:23am

I found that the add_special_tokens=False setting seems to be driving the above behaviour. And this is dependent on conda environment (for a fixed Transformer version 1.34.1).

sheelsansare · December 13, 2023, 1:31am

Hey, has anyone got any answers to the above? Thanks!

Topic		Replies	Views
What I know and don't know about sequence to sequence batching 🤗Transformers	3	2046	September 11, 2020
Encoder-Decoder model only generates bos_token's [<s><s><s>] Models	17	3161	December 6, 2022
BART_LM: Odd Beam Search Output Intermediate	18	1844	August 17, 2020
How to generate a sequence using inputs_embeds instead of input_ids? 🤗Transformers	4	8508	April 17, 2022
Bart outputing </s> in start of every decoded sentence 🤗Transformers	1	538	August 28, 2021

BART - Input format

Related topics