Hi folks,
Problem: fine-tuned model adopts peculiar behaviour with beam search.
Specifically, beam search outputs include 2 bos tokens and exclude the first word token. I have double checked my data feed and the inputs look fine, with one bos and the first word token is present. Even more strange, there are no such problems with nucleus sampling, which suggests that this is particular to beam search.
for example, here are two outputs from the same fine-tuned model;
beam search output: <s><s>rek likes to play football. </s>
nuleus sampling output: <s> Derek likes to go to the football pitch. </s>
Any clues?
1 Like
Hi,how did you prepare labels
and decoder_input_ids
?
As the model is generating bos token my guess is you didn’t shift labels. For BART we need to prepare labels
and deocder_input_ids
like this
target_ids = tokenizer("...")
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
Hi @valhalla, double checked but there might be a hole - here is my training_step function (I use lightning) :
def training_step(self, batch, batch_id):
"""see lighting docs. Need to make sure that first token is not ignored"""
decoder_inputs = batch["target_ids"][:, :-1].contiguous()
decoder_labels = batch["target_ids"][:, 1:].clone()
decoder_labels[batch["target_ids"][:, 1:] == self.tokenizer.pad_token_id] = -100
loss = self(source_ids=batch["source_ids"],
padding_mask=batch["padding_mask"],
decoder_inputs=decoder_inputs,
decoder_labels=decoder_labels)[0]
return {"loss": loss}
here’s my dataset getitem for some extra info:
def __getitem__(self, index) -> dict:
"""Generates ONE sample of data, DataLoader takes care of batches. We return everything here to sacrifice speed
for ease of coding / reading. Can make faster in time but not needed as of yet.
:param index: ID of single file we want to load data from
:return item: dictionary containing all possible permutations of data for all possible objectives, such that X
is the dialogue, y is the summary, and s is the sep_token " TL;DR: ".
"""
SAMsumID = self.ids[index - 1]
# load source sample string
# space added to beginning for tokenization purposes ie tok("name") !+ tok(" name")
source_file_path = f"{self.root}/source/source_{SAMsumID}.txt"
with open(source_file_path, "r") as source_file:
dialogue = source_file.readline()
if "t5" in self.arch: dialogue = "summarize: " + dialogue
# load target sample string
# space added to beginning for tokenization purposes ie tok("name") !+ tok(" name")
target_file_path = f"{self.root}/target/target_{SAMsumID}.txt"
with open(target_file_path, "r") as target_file:
summary = target_file.readline()
# we will return a dictionary for each getitem() call that is collated by dataloader
item = {}
input_tokens = self.tokenizer(dialogue, padding="max_length", truncation=True, add_special_tokens=True)
target_tokens = self.tokenizer(summary, padding="max_length", truncation=True, add_special_tokens=True)
# return necessary batch tensors
item["SAMsumID"] = SAMsumID
item["source_ids"] = torch.tensor(input_tokens["input_ids"])
item["padding_mask"] = torch.tensor(input_tokens["attention_mask"])
item["target_ids"] = torch.tensor(target_tokens["input_ids"])
return item
can you show an example where you call generate
twice with the same input_ids and model, but different kwargs, and get wierd beam search output?
I suspect the issue is in generate kwargs/decoder_start_token_id
if same model -> different outputs.
Both models have the same, default decoder_start_token_id
, and I when I manually override this to any arbitrary token <a>
, the beam output looks something like <a><s>rek likes to play football. </s>.
EDIT: The below code has decoder skip_special_tokens=True
, but this flag is false for above examples.
def test_step(self, batch, batch_id):
"""Generate a summary and write it to a file while calculating rouge"""
# We store the file id to help with consistent results across several GPUs
SAMsumIDs = batch["SAMsumID"]
# generate y_preds given seed_ids, placing kwargs in dict is handy for debugging
if self.sampling == "beam":
generate_kwargs = {"input_ids": batch["source_ids"],
"early_stopping": self.generation_early_stopping,
"no_repeat_ngram_size": self.no_repeat_ngrams,
"max_length": self.tokenizer.max_len,
"top_k": 50,
"num_beams": self.beam_size}
elif self.sampling == "nucleus":
generate_kwargs = {"input_ids": batch["source_ids"],
"max_length": self.tokenizer.max_len,
"top_k": 50,
"do_sample": True,
"top_p": 0.95}
y_preds = self.model.generate(**generate_kwargs)
# save generated hypotheses to hypo dest with the same name as the original file, but with a 0 prepended
for file_id, hypo_token_ids in zip(SAMsumIDs, y_preds):
prefix = "source_0" if self.anti else "hypo_"
filename = f"{self.hypo_dest}/{prefix}{file_id}.txt"
with open(filename, "w") as handle:
hypo = self.tokenizer.decode(hypo_token_ids, skip_special_tokens=True)
handle.write(f"{hypo}\n")
return {"hypo_dialogues": None}
1 Like
Another update: everything runs smoothly for T5 with all other code unchanged
@sshleifer, why do we need to manually set the bos_token_id
here, I think the generate
function handles that here already https://github.com/huggingface/transformers/blob/master/src/transformers/generation_utils.py#L402
It’s copied from fairseq, but the basic idea is that you want to know the real probability of BOS, not just force it with probability 1. If I recall (open to being disproved), removing it breaks
RUN_SLOW=1 pytest tests/test_modeling_bart.py
1 Like
I look forward to taking a closer look at this breakage… eventually. Thank you both for the help!
@chrisdoyleIE,
so the reason for two bos
tokens in beam search is that, here the generate function sets the decoder_start_token_id
(if not defined then bos_token_id
, which <s>
for BART) as the prefix token,
and this forces the generation of bos_token
(<s>
) when the current length is one.
So I also faced the same issue for my bart model fine-tuned for question generation when using beam search. This is what it produced for one of the inputs.
['<s><s> created Python in 1991?']
So when I just returned logits as it is from adjust_logits_during_generation
instead of forcing another bos_token
it generated this
['<s>Who created Python?']
Which is what I expected.
But as @sshleifer said, when I modified adjust_logits_during_generation
as said above it breaks the
RUN_SLOW=1 pytest tests/test_modeling_bart.py
because the generated summaries don’t match the fairseq ones.
So when I looked at the configs of the bart-large-cnn and bart-large-xsum, it used eos_token
(</s>
) as the decoder_start_token_id
and test only pass when the first two token are </s><s>
(</s>
is set as first token by .generate
and <s>
is set by adjust_logits_during_generation
.
So I suspected that the weird beam search output for my model is because of using <s>
as decoder_start_token_id
, when I tried using </s>
as the start token like the summerization models it again gave weird results.
['</s><s> created Python in 1991?']
So to me it seems that adjust_logits_during_generation
and </s>
as the decoder_start_token
is only working for the pre-trained summrization models and not for other bart-large or bart-base finetuned models.
@chrisdoyleIE can you also confirm this with your model
- first do beam search using
eos_token_id
as the decoder_start_token_id
- use the bos_token as start token (as usual) and return logits as it is from
adjust_logits_during_generation
instead of forcing <s>
. You can do it using
def adjust_logits(logits, **kwargs):
return logits
model.adjust_logits_during_generation = adjust_logits
@sshleifer, if </s>
is the decoder_start_token
then why don’t we set it when fine-tuning the model ? Also in fairseq they seem to use the bos token <s>
as the first prefix. Here
Sorry for this long issue, but no other way to explain it.
2 Likes
@valhalla this is a stellar investigation, thank you very much! I’ll need about 2 weeks to submit my thesis then look forward to taking a look.

@sshleifer, @chrisdoyleIE
After modifying the adjust_logits_during_generation
BLUE-4 score for my model went from 13.09 to 19.14. So this indeed is an issue.
2 Likes
Can you PR a fix/ or update the docs with the suggested behavior? I am a little busy at the moment.
1 Like
In the config of your finetuned model, what was decoder_start_token_id?
In your training code, what did a batch “look like”?
decoder_start_token_id
was not defined in the config
. So generate
used bos
when it’s not defined in the config
. And the batches were simply prepared like this
src_text = "Python was created by <hl> Guido van Rossum <hl> in 1991."
target _text = "Who created python ?"
input_ids = tokenizer(src_text, return_tensors="pt")["input_ids"]
target_ids = tokenizer(target , return_tensors="pt")["input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()