I am following the Trainer example to fine-tune a Bert model on my data for text classification, using the pre-trained tokenizer (
In all examples I have found, the input texts are either single sentences or lists of sentences. However, my data is one string per document, comprising multiple sentences. When I inspect the tokenizer output, there are no
[SEP] tokens put in between the sentences, e.g.:
This is how I tokenize my dataset:
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length') train_dataset = train_dataset.map(encode, batched=True)
And this is an example result of the tokenization:
tokenizer.decode(train_dataset["input_ids"]) [CLS] this is the first sentence . this is the second sentence. [SEP]
Given the special tokens in the beginning and the end, and that the output is lower-cased, I see that the input has been tokenized as expected. However, I was expecting to see a
[SEP] between each sentence, as is the case when the input comprises a list of sentences.
What is the recommended approach? Should I split the input documents into sentences, and run the tokenizer on each of them? Or does the Transformer model handle the continuous stream of sentences?
I have seen posts like this:
However, it is not clear to me if this applied for a standard pipeline.