What is the magic behind BartForConditionalGeneration？

voidful · March 29, 2021, 7:27am

For some reason, I want to modify the linear layer inside BartForConditionalGeneration. Therefore, I use a BartModel with Linear just like BartForConditionalGeneration. The Performance has a large drop-down when using BartModel with Linear. It’s so strange

For same training and evaluation data:
BartForConditionalGeneration
{'Bleu_1': 0.3756316307557612, 'Bleu_2': 0.2187763449001214, 'Bleu_3': 0.14257622050968358, 'Bleu_4': 0.09772224033332834, 'ROUGE_L': 0.31379157899331667, 'CIDEr': 0.2487453519966872}
BartModel with Linear
{'Bleu_1': 0.28135212418299216, 'Bleu_2': 0.039374791862140796, 'Bleu_3': 5.7869382968790495e-08, 'Bleu_4': 9.583990840791874e-11, 'ROUGE_L': 0.13023605134624447, 'CIDEr': 0.012828799693149772}

Here is my code
BartForConditionalGeneration
BartModel with Linear

Some trial and notes for your reference:

use set_output_embeddings to replace linear layer - dropdown
tie linear weight with BartModel.shared weight - dropdown
re-init linear weight with config.std and 0 mean - dropdown
clone BartModel.shared weight to linear weight - dropdown
add bias or remove bias - dropdown
extend BartForConditionalGeneration and rename lm_head model - dropdown
use different seq2seq models (t5) - dropdown

voidful · March 29, 2021, 3:31pm

For copy shard weight or tie shared.weight, the result will be little bit better, but still far from BartForConditionalGeneration:

Copy weight:

import copy
lm_head = nn.Linear(pretrained.config.hidden_size, tokenizer.__len__(), bias=False).to(device)
lm_head.weight = copy.copy(pretrained.shared.weight)

Result:
{'Bleu_1': 0.3009317871068947, 'Bleu_2': 0.15865498886231086, 'Bleu_3': 0.09005394179103642, 'Bleu_4': 0.05191279861663496, 'ROUGE_L': 0.22372818945128858, 'CIDEr': 0.15579250859745616}

Tie weight:

lm_head = nn.Linear(pretrained.config.hidden_size, tokenizer.__len__(), bias=False).to(device)
lm_head.weight = pretrained.shared.weight

Result：
·{‘Bleu_1’: 0.30315688210424285, ‘Bleu_2’: 0.1590543852533103, ‘Bleu_3’: 0.08880157836836094, ‘Bleu_4’: 0.04979010468389569, ‘ROUGE_L’: 0.22960729767442484, ‘CIDEr’: 0.1570861241454517}·

BramVanroy · March 29, 2021, 4:13pm

In BartForConditionalGeneration a fixed bias is included as well.

github.com

huggingface/transformers/blob/06a6fea7820dc3e89d09430a49bce1c72b173647/src/transformers/models/bart/modeling_bart.py#L1209


@add_start_docstrings(
    "The BART Model with a language modeling head. Can be used for summarization.", BART_START_DOCSTRING
)
class BartForConditionalGeneration(BartPretrainedModel):
    base_model_prefix = "model"
    _keys_to_ignore_on_load_missing = [r"final_logits_bias", r"lm_head\.weight"]

    def __init__(self, config: BartConfig):
        super().__init__(config)
        self.model = BartModel(config)
        self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
        self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)

        self.init_weights()

    def get_encoder(self):
        return self.model.get_encoder()

    def get_decoder(self):
        return self.model.get_decoder()

github.com

huggingface/transformers/blob/06a6fea7820dc3e89d09430a49bce1c72b173647/src/transformers/models/bart/modeling_bart.py#L1293


    head_mask=head_mask,
    decoder_head_mask=decoder_head_mask,
    past_key_values=past_key_values,
    inputs_embeds=inputs_embeds,
    decoder_inputs_embeds=decoder_inputs_embeds,
    use_cache=use_cache,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)
lm_logits = self.lm_head(outputs[0]) + self.final_logits_bias

masked_lm_loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))

if not return_dict:
    output = (lm_logits,) + outputs[1:]
    return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

voidful · March 29, 2021, 4:52pm

I have tried print final_logits_bias during training and evaluation, it remain zero all the time. So it should be fine to ignore it ?

voidful · March 29, 2021, 5:18pm

When manually decode the result from custom linear, it‘s different from the model.generate result.

From BartModel with Linear

demo = """Looking good , feeling good Born to a model mom and a suit maker dad , fashion was actually in my blood . I always had a strong desire to dress in a certain way and to stand out from the crowd . I made my own toys when I was a young child and sewed my first skirt at just 10 years old . A friend 's mother took one look at my skirt and told me that I should be a patternmaker . In high school I started making my own clothes , mostly changing other things because I never liked anything how it was when I bought it . During the last two years of school , I worked part - time for a small business that made hand - painted silk clothing and bags . The owner became the teacher who got me into design in the first place . Another useful bit of work experience then came when I worked at a showroom during fashion week and found it very exciting . From there I worked at a top clothing store while I got my business started . For my business I started out with the idea that everything I did would be hand - made and one - of - a - kind , specially made for one individual who hopefully had the same tastes as me . Every morning I jumped out of bed , went to my studio and worked on my projects . This just showed how enthusiastic I felt about my work . And at night I even dreamed of new designs ! Fashion design is _ art . What I mean is that it 's something close to you and something you can touch and feel , and actually interact with . My advice to any young person who wants to be a fashion designer is to get the basic skills early on , such as sewing and pattern - making . Even if you end up specializing , it 's really important to understand all aspects of design in order to make high - quality clothes . Also , if you dream of having your own clothing line , the best thing to do is start wearing your clothes . You have to try and do this because that 's the way you 're going to develop something that 's all yours and unlike anyone else 's . I passionately believe that the right clothing can make people feel better and give them more confidence . </s> When the author was in high school , she </s> began to make clothes on her own"""
pretrained.eval()
lm_head.eval()
input_ids = tokenizer.encode(demo,return_tensors='pt',add_special_tokens=False).to(device)
tokenizer.decode(torch.argmax(lm_head(pretrained(input_ids).last_hidden_state),-1)[0],skip_special_tokens=True)

Result:
made her to to make

However,

from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers import pipeline
hf_model = BartForConditionalGeneration.from_pretrained('facebook/bart-base').to('cuda')

hf_model.lm_head = lm_head
hf_model.model = pretrained

result = hf_model.generate(input_ids,do_sample=False,top_k=1,num_beams=1)
print(result)
tokenizer.decode(result[0],skip_special_tokens=True)

Result
made her to make clothes

p208p2002 · March 30, 2021, 3:42am

please note that the behavior(shift_tokens_right) in BartModel and BartForConditionalGeneration is different

github.com

huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L1140


      
                      all_hidden_states += (hidden_states,)
          
          
        next_cache = next_decoder_cache if use_cache else None
                  if not return_dict:
                      return tuple(
                          v
                          for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_cross_attentions]
                          if v is not None
                      )
                  return BaseModelOutputWithPastAndCrossAttentions(
                      last_hidden_state=hidden_states,
                      past_key_values=next_cache,
                      hidden_states=all_hidden_states,
                      attentions=all_self_attns,
                      cross_attentions=all_cross_attentions,
                  )
          
          

          
@add_start_docstrings(
              "The bare BART Model outputting raw hidden-states without any specific head on top.",
              BART_START_DOCSTRING,

github.com

huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L1274


      
                  return Seq2SeqModelOutput(
                      last_hidden_state=decoder_outputs.last_hidden_state,
                      past_key_values=decoder_outputs.past_key_values,
                      decoder_hidden_states=decoder_outputs.hidden_states,
                      decoder_attentions=decoder_outputs.attentions,
                      cross_attentions=decoder_outputs.cross_attentions,
                      encoder_last_hidden_state=encoder_outputs.last_hidden_state,
                      encoder_hidden_states=encoder_outputs.hidden_states,
                      encoder_attentions=encoder_outputs.attentions,
                  )
          
          

          
@add_start_docstrings(
              "The BART Model with a language modeling head. Can be used for summarization.", BART_START_DOCSTRING
          )
          class BartForConditionalGeneration(BartPretrainedModel):
              base_model_prefix = "model"
              _keys_to_ignore_on_load_missing = [r"final_logits_bias", r"lm_head.weight"]
          
          
    def __init__(self, config: BartConfig):
                  super().__init__(config)

you should consider manual provide decoder_input_ids, and you can get the correct results

voidful · March 30, 2021, 6:22am

I believe this is the reason. Thank you for pointing it out.

Topic		Replies	Views
Bug in BartForConditionalGeneration's intialisation of lm_head 🤗Transformers	0	263	October 16, 2021
Inheriting from BartForConditionalGeneration into a new class - weight not initializing Beginners	4	938	March 16, 2021
Mismatch of tensor shapes in CrossEntropyLoss for custom head layer in BART Beginners	0	266	January 30, 2023
BART on numerical data 🤗Transformers	0	371	April 7, 2023
Question regarding training of BartForConditionalGeneration Models	1	2025	March 2, 2021

What is the magic behind BartForConditionalGeneration？

Copy weight:

Tie weight:

Related topics