I have the problem that the BART tokenizer adds two tokens after the first segment.
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
tokenizer.encode("What do you think?", "Nothing.", return_tensors="pt")
It outputs:
tensor([[ 0, 2264, 109, 47, 206, 116, 2, 2, 19847, 4, 2]])
Let’s decode again:
tokenizer.batch_decode(x)
['<s>What do you think?</s></s>Nothing.</s>']
Can I do anything about it?