I finally managed to get a Bart model pretraining, but I have a strange problem. My custom data is not natural language, and doesn’t have normal sentence structure, so I don’t explicitly use <s>
and </s>
. I have a special token that I generally use to delineate segments during inference/generation, but it’s also part of my data representation, so I keep it in my outputs. Because of this, when building my “noisy” dataset, I formatted my decoder_input
by adding a <pad>
token at the start of each input and truncating the end (i.e., to mimic the “shifting” of the sentence). My “noising” function looks like:
def add_noise(examples):
line = examples['text']
stripped = line.strip()
labels = stripped.split(' ')
split = split_point(len(labels))
decoder_input = labels[split:] + labels[:split]
input = mask_spans(decoder_input, 0.3, tokenizer.mask_token)
examples['input'] = ' '.join(input)
examples['decoder_input'] = ' '.join([tokenizer.pad_token] + decoder_input[:-1])
examples['labels'] = ' '.join(labels)
return examples
So if an original “sentence” in my non-language data was something like:
text: This is not language as you know it.
Then my data field for the decoder would be something like:
decoder_input: <pad> This is not language as you know
(I’m truncating the end because my raw data is pre-formatted so that every line is max_length
tokens long.)
When running infilling generation, the output works as expected at the position of the <mask>
(which is awesome!) but I’m noticing that the start of the sentence always cuts off 2 tokens, and inserts </s><s>
. That is, if I input:
Do some awesome <mask> for me, please.
I get an output like:
</s><s> awesome sentence infilling for me, please.
I think I understand why the </s><s>
is there, since this would be the shifted version of a typical input and I’m keeping special tokens on output (since I use my own special tokens later in processing). But I don’t understand why the 2 tokens (“Do some”) would get cut from the start. I’m just generating with this:
prediction = model.generate(
token_ids["input_ids"],
max_length=target_len,
do_sample=True,
top_k=50,
return_dict_in_generate=True,
output_scores=True
)
…so nothing unusual.
I seem to remember seeing a comment in a forum post that we don’t need to manually shift our decoder_input
since HF will do this automatically, but is that true? And is that perhaps why 2 tokens are getting chopped off the start of my inputs?
At the end of the day it isn’t a huge deal, since I (obviously) have these tokens from my input and can just prepend them to the result. But I’d like to have it working as expected, just in case it’s an indication that something else is wrong.
Thanks in advance for any thoughts.