Pretraining BART for conditional generation

I want to use BartForConditionalGeneration with my custom data. The main functionality I’m interested in is sentence infilling (i.e., multiple mask prediction)—as shown in the BartForConditionalGeneration example from the docs. I won’t actually be doing translation or summarization, but all the info I’m finding about pretraining BART seems specific to those two tasks.

How do I go about pretraining for the conditional generation task? My dataset is currently just the one I’ve used for pretraining GPT-2 and RoBERTa models—i.e., no “labels”.

Any help appreciated.

I finally managed to get a Bart model pretraining, but I have a strange problem. My custom data is not natural language, and doesn’t have normal sentence structure, so I don’t explicitly use <s> and </s>. I have a special token that I generally use to delineate segments during inference/generation, but it’s also part of my data representation, so I keep it in my outputs. Because of this, when building my “noisy” dataset, I formatted my decoder_input by adding a <pad> token at the start of each input and truncating the end (i.e., to mimic the “shifting” of the sentence). My “noising” function looks like:

def add_noise(examples):
    line = examples['text']
    stripped = line.strip()
    labels = stripped.split(' ')
    split = split_point(len(labels))
    decoder_input = labels[split:] + labels[:split]
    input = mask_spans(decoder_input, 0.3, tokenizer.mask_token)
    examples['input'] = ' '.join(input)
    examples['decoder_input'] = ' '.join([tokenizer.pad_token] + decoder_input[:-1])
    examples['labels'] = ' '.join(labels)
    return examples

So if an original “sentence” in my non-language data was something like:

text: This is not language as you know it.

Then my data field for the decoder would be something like:

decoder_input: <pad> This is not language as you know

(I’m truncating the end because my raw data is pre-formatted so that every line is max_length tokens long.)

When running infilling generation, the output works as expected at the position of the <mask> (which is awesome!) but I’m noticing that the start of the sentence always cuts off 2 tokens, and inserts </s><s>. That is, if I input:

Do some awesome <mask> for me, please.

I get an output like:

</s><s> awesome sentence infilling for me, please.

I think I understand why the </s><s> is there, since this would be the shifted version of a typical input and I’m keeping special tokens on output (since I use my own special tokens later in processing). But I don’t understand why the 2 tokens (“Do some”) would get cut from the start. I’m just generating with this:

prediction = model.generate(
    token_ids["input_ids"], 
    max_length=target_len, 
    do_sample=True,
    top_k=50, 
    return_dict_in_generate=True, 
    output_scores=True
)

…so nothing unusual.

I seem to remember seeing a comment in a forum post that we don’t need to manually shift our decoder_input since HF will do this automatically, but is that true? And is that perhaps why 2 tokens are getting chopped off the start of my inputs?

At the end of the day it isn’t a huge deal, since I (obviously) have these tokens from my input and can just prepend them to the result. But I’d like to have it working as expected, just in case it’s an indication that something else is wrong.

Thanks in advance for any thoughts.