What is the magic behind BartForConditionalGeneration?

For some reason, I want to modify the linear layer inside BartForConditionalGeneration. Therefore, I use a BartModel with Linear just like BartForConditionalGeneration. The Performance has a large drop-down when using BartModel with Linear. It’s so strange :sob: :cry:

For same training and evaluation data:
BartForConditionalGeneration
{'Bleu_1': 0.3756316307557612, 'Bleu_2': 0.2187763449001214, 'Bleu_3': 0.14257622050968358, 'Bleu_4': 0.09772224033332834, 'ROUGE_L': 0.31379157899331667, 'CIDEr': 0.2487453519966872}
BartModel with Linear
{'Bleu_1': 0.28135212418299216, 'Bleu_2': 0.039374791862140796, 'Bleu_3': 5.7869382968790495e-08, 'Bleu_4': 9.583990840791874e-11, 'ROUGE_L': 0.13023605134624447, 'CIDEr': 0.012828799693149772}

Here is my code
BartForConditionalGeneration
BartModel with Linear

Some trial and notes for your reference:

  • use set_output_embeddings to replace linear layer - dropdown
  • tie linear weight with BartModel.shared weight - dropdown
  • re-init linear weight with config.std and 0 mean - dropdown
  • clone BartModel.shared weight to linear weight - dropdown
  • add bias or remove bias - dropdown
  • extend BartForConditionalGeneration and rename lm_head model - dropdown
  • use different seq2seq models (t5) - dropdown

For copy shard weight or tie shared.weight, the result will be little bit better, but still far from BartForConditionalGeneration:

Copy weight:

import copy
lm_head = nn.Linear(pretrained.config.hidden_size, tokenizer.__len__(), bias=False).to(device)
lm_head.weight = copy.copy(pretrained.shared.weight)

Result:
{'Bleu_1': 0.3009317871068947, 'Bleu_2': 0.15865498886231086, 'Bleu_3': 0.09005394179103642, 'Bleu_4': 0.05191279861663496, 'ROUGE_L': 0.22372818945128858, 'CIDEr': 0.15579250859745616}

Tie weight:

lm_head = nn.Linear(pretrained.config.hidden_size, tokenizer.__len__(), bias=False).to(device)
lm_head.weight = pretrained.shared.weight

Result:
·{‘Bleu_1’: 0.30315688210424285, ‘Bleu_2’: 0.1590543852533103, ‘Bleu_3’: 0.08880157836836094, ‘Bleu_4’: 0.04979010468389569, ‘ROUGE_L’: 0.22960729767442484, ‘CIDEr’: 0.1570861241454517}·

In BartForConditionalGeneration a fixed bias is included as well.

1 Like

I have tried print final_logits_bias during training and evaluation, it remain zero all the time. So it should be fine to ignore it ?

When manually decode the result from custom linear, it‘s different from the model.generate result.

From BartModel with Linear

demo = """Looking good , feeling good Born to a model mom and a suit maker dad , fashion was actually in my blood . I always had a strong desire to dress in a certain way and to stand out from the crowd . I made my own toys when I was a young child and sewed my first skirt at just 10 years old . A friend 's mother took one look at my skirt and told me that I should be a patternmaker . In high school I started making my own clothes , mostly changing other things because I never liked anything how it was when I bought it . During the last two years of school , I worked part - time for a small business that made hand - painted silk clothing and bags . The owner became the teacher who got me into design in the first place . Another useful bit of work experience then came when I worked at a showroom during fashion week and found it very exciting . From there I worked at a top clothing store while I got my business started . For my business I started out with the idea that everything I did would be hand - made and one - of - a - kind , specially made for one individual who hopefully had the same tastes as me . Every morning I jumped out of bed , went to my studio and worked on my projects . This just showed how enthusiastic I felt about my work . And at night I even dreamed of new designs ! Fashion design is _ art . What I mean is that it 's something close to you and something you can touch and feel , and actually interact with . My advice to any young person who wants to be a fashion designer is to get the basic skills early on , such as sewing and pattern - making . Even if you end up specializing , it 's really important to understand all aspects of design in order to make high - quality clothes . Also , if you dream of having your own clothing line , the best thing to do is start wearing your clothes . You have to try and do this because that 's the way you 're going to develop something that 's all yours and unlike anyone else 's . I passionately believe that the right clothing can make people feel better and give them more confidence . </s> When the author was in high school , she </s> began to make clothes on her own"""
pretrained.eval()
lm_head.eval()
input_ids = tokenizer.encode(demo,return_tensors='pt',add_special_tokens=False).to(device)
tokenizer.decode(torch.argmax(lm_head(pretrained(input_ids).last_hidden_state),-1)[0],skip_special_tokens=True)

Result:
made her to to make

However,

from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers import pipeline
hf_model = BartForConditionalGeneration.from_pretrained('facebook/bart-base').to('cuda')

hf_model.lm_head = lm_head
hf_model.model = pretrained

result = hf_model.generate(input_ids,do_sample=False,top_k=1,num_beams=1)
print(result)
tokenizer.decode(result[0],skip_special_tokens=True)

Result
made her to make clothes

please note that the behavior(shift_tokens_right) in BartModel and BartForConditionalGeneration is different

you should consider manual provide decoder_input_ids, and you can get the correct results

2 Likes

I believe this is the reason. Thank you for pointing it out. :smiling_face_with_three_hearts: