Hi everyone.
I was going through the * Fine-tuning a masked language model section in the course and I can’t understand one thing there. I’m talking about the following piece of code:
concatenated_examples = {
k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")
From what I know, BERT can only have one [CLS] token at the beginning. So, if we concat all these texts and split them into chunks, we will get multiple [CLS] tokens (which is not BERT-like)
[CLS] ... [SEP] [CLS] ... [SEP] ...
in a sequence. Why does it work? Is there any paper that describes this behavior or maybe any other source?
Thank you.