BART_LM: Odd Beam Search Output

Hi folks,

Problem: fine-tuned model adopts peculiar behaviour with beam search.

Specifically, beam search outputs include 2 bos tokens and exclude the first word token. I have double checked my data feed and the inputs look fine, with one bos and the first word token is present. Even more strange, there are no such problems with nucleus sampling, which suggests that this is particular to beam search.

for example, here are two outputs from the same fine-tuned model;

beam search output: <s><s>rek likes to play football. </s>
nuleus sampling output: <s> Derek likes to go to the football pitch. </s>

Any clues?

Hi,how did you prepare labels and decoder_input_ids ?

As the model is generating bos token my guess is you didn’t shift labels. For BART we need to prepare labels and deocder_input_ids like this

target_ids = tokenizer("...")
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()

Hi @valhalla, double checked but there might be a hole - here is my training_step function (I use lightning) :

    def training_step(self, batch, batch_id):
        """see lighting docs. Need to make sure that first token is not ignored"""

        decoder_inputs = batch["target_ids"][:, :-1].contiguous()
        decoder_labels = batch["target_ids"][:, 1:].clone()
        decoder_labels[batch["target_ids"][:, 1:] == self.tokenizer.pad_token_id] = -100

        loss = self(source_ids=batch["source_ids"],

        return {"loss": loss}

Pinging @sshleifer

1 Like

here’s my dataset getitem for some extra info:

    def __getitem__(self, index) -> dict:
        """Generates ONE sample of data, DataLoader takes care of batches. We return everything here to sacrifice speed
        for ease of coding / reading. Can make faster in time but not needed as of yet.
        :param index: ID of single file we want to load data from
        :return item: dictionary containing all possible permutations of data for all possible objectives, such that X
        is the dialogue, y is the summary, and s is the sep_token " TL;DR: ".

        SAMsumID = self.ids[index - 1]

        # load source sample string
        # space added to beginning for tokenization purposes ie tok("name") !+ tok(" name")
        source_file_path = f"{self.root}/source/source_{SAMsumID}.txt"
        with open(source_file_path, "r") as source_file:
            dialogue = source_file.readline()
            if "t5" in self.arch: dialogue = "summarize: " + dialogue

        # load target sample string
        # space added to beginning for tokenization purposes ie tok("name") !+ tok(" name")
        target_file_path = f"{self.root}/target/target_{SAMsumID}.txt"
        with open(target_file_path, "r") as target_file:
            summary = target_file.readline()

        # we will return a dictionary for each getitem() call that is collated by dataloader
        item = {}
        input_tokens = self.tokenizer(dialogue, padding="max_length", truncation=True, add_special_tokens=True)
        target_tokens = self.tokenizer(summary, padding="max_length", truncation=True, add_special_tokens=True)

        # return necessary batch tensors
        item["SAMsumID"] = SAMsumID
        item["source_ids"] = torch.tensor(input_tokens["input_ids"])
        item["padding_mask"] = torch.tensor(input_tokens["attention_mask"])
        item["target_ids"] = torch.tensor(target_tokens["input_ids"])

        return item

can you show an example where you call generate twice with the same input_ids and model, but different kwargs, and get wierd beam search output?

I suspect the issue is in generate kwargs/decoder_start_token_id if same model -> different outputs.

Both models have the same, default decoder_start_token_id, and I when I manually override this to any arbitrary token <a>, the beam output looks something like <a><s>rek likes to play football. </s>.

EDIT: The below code has decoder skip_special_tokens=True, but this flag is false for above examples.

    def test_step(self, batch, batch_id):
        """Generate a summary and write it to a file while calculating rouge"""

        # We store the file id to help with consistent results across several GPUs
        SAMsumIDs = batch["SAMsumID"]

        # generate y_preds given seed_ids, placing kwargs in dict is handy for debugging
        if self.sampling == "beam":
            generate_kwargs = {"input_ids": batch["source_ids"],
                               "early_stopping": self.generation_early_stopping,
                               "no_repeat_ngram_size": self.no_repeat_ngrams,
                               "max_length": self.tokenizer.max_len,
                               "top_k": 50,
                               "num_beams": self.beam_size}

        elif self.sampling == "nucleus":
            generate_kwargs = {"input_ids": batch["source_ids"],
                               "max_length": self.tokenizer.max_len,
                               "top_k": 50,
                               "do_sample": True,
                               "top_p": 0.95}

        y_preds = self.model.generate(**generate_kwargs)
        # save generated hypotheses to hypo dest with the same name as the original file, but with a 0 prepended
        for file_id, hypo_token_ids in zip(SAMsumIDs, y_preds):
            prefix = "source_0" if self.anti else "hypo_"
            filename = f"{self.hypo_dest}/{prefix}{file_id}.txt"
            with open(filename, "w") as handle:
                hypo = self.tokenizer.decode(hypo_token_ids, skip_special_tokens=True)

        return {"hypo_dialogues": None}
1 Like

Another update: everything runs smoothly for T5 with all other code unchanged

Maybe this line is causing an issue?

1 Like

@sshleifer, why do we need to manually set the bos_token_id here, I think the generate function handles that here already

It’s copied from fairseq, but the basic idea is that you want to know the real probability of BOS, not just force it with probability 1. If I recall (open to being disproved), removing it breaks

RUN_SLOW=1 pytest tests/
1 Like

I look forward to taking a closer look at this breakage… eventually. Thank you both for the help!

so the reason for two bos tokens in beam search is that, here the generate function sets the decoder_start_token_id (if not defined then bos_token_id , which <s> for BART) as the prefix token,
and this forces the generation of bos_token (<s>) when the current length is one.
So I also faced the same issue for my bart model fine-tuned for question generation when using beam search. This is what it produced for one of the inputs.
['<s><s> created Python in 1991?']

So when I just returned logits as it is from adjust_logits_during_generation instead of forcing another bos_token it generated this
['<s>Who created Python?']
Which is what I expected.

But as @sshleifer said, when I modified adjust_logits_during_generation as said above it breaks the
RUN_SLOW=1 pytest tests/ because the generated summaries don’t match the fairseq ones.

So when I looked at the configs of the bart-large-cnn and bart-large-xsum, it used eos_token (</s>) as the decoder_start_token_id and test only pass when the first two token are </s><s> (</s> is set as first token by .generate and <s> is set by adjust_logits_during_generation.

So I suspected that the weird beam search output for my model is because of using <s> as decoder_start_token_id, when I tried using </s> as the start token like the summerization models it again gave weird results.
['</s><s> created Python in 1991?']

So to me it seems that adjust_logits_during_generation and </s> as the decoder_start_token is only working for the pre-trained summrization models and not for other bart-large or bart-base finetuned models.

@chrisdoyleIE can you also confirm this with your model

  1. first do beam search using eos_token_id as the decoder_start_token_id
  2. use the bos_token as start token (as usual) and return logits as it is from adjust_logits_during_generation instead of forcing <s>. You can do it using
def adjust_logits(logits, **kwargs):
    return logits
model.adjust_logits_during_generation = adjust_logits

@sshleifer, if </s> is the decoder_start_token then why don’t we set it when fine-tuning the model ? Also in fairseq they seem to use the bos token <s> as the first prefix. Here

Sorry for this long issue, but no other way to explain it.


@valhalla this is a stellar investigation, thank you very much! I’ll need about 2 weeks to submit my thesis then look forward to taking a look.


@sshleifer, @chrisdoyleIE
After modifying the adjust_logits_during_generation BLUE-4 score for my model went from 13.09 to 19.14. So this indeed is an issue.


Can you PR a fix/ or update the docs with the suggested behavior? I am a little busy at the moment.

1 Like

In the config of your finetuned model, what was decoder_start_token_id?
In your training code, what did a batch “look like”?

decoder_start_token_id was not defined in the config. So generate used bos when it’s not defined in the config. And the batches were simply prepared like this

src_text = "Python was created by <hl> Guido van Rossum <hl> in 1991."
target _text = "Who created python ?"
input_ids = tokenizer(src_text, return_tensors="pt")["input_ids"]
target_ids =  tokenizer(target , return_tensors="pt")["input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()