Hello
I am using facebook/BART for seq2sea task. I followed along this tutorial Translation - Hugging Face NLP Course and found something weird.
When I use BART to predict a samples there are -100 tokens in the array of output
array([ 2, 0, 8800, 3850, 37589, 1000, 3675, 23054, 7778,
2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100])
You can see that it already has 1 as a BART’s padding token. So, where -100 come from??
The model used in tutorial is Marian and It does predict any -100.
this is my training args
args = Seq2SeqTrainingArguments(
save_folder,
overwrite_output_dir=True,
logging_strategy="epoch",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-6,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=20,
predict_with_generate=True,
fp16=True,
report_to="none",
load_best_model_at_end=True,
seed=65,
generation_max_length=128,
generation_num_beams=10,
)
When -100 are in predictions, it break this function
def postprocess(predictions, labels):
predictions = predictions.cpu().numpy()
labels = labels.cpu().numpy()
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
return decoded_preds, decoded_labels
I also would like to know how to control length of predicted text?? right now my predictions are shape of (m, 79). Why 79??