I am following the Trainer example to fine-tune a Bert model on my data for text classification, using the pre-trained tokenizer (bert-base-uncased).
In all examples I have found, the input texts are either single sentences or lists of sentences. However, my data is one string per document, comprising multiple sentences. When I inspect the tokenizer output, there are no [SEP] tokens put in between the sentences, e.g.:
And this is an example result of the tokenization:
tokenizer.decode(train_dataset[0]["input_ids"])
[CLS] this is the first sentence . this is the second sentence. [SEP]
Given the special tokens in the beginning and the end, and that the output is lower-cased, I see that the input has been tokenized as expected. However, I was expecting to see a [SEP] between each sentence, as is the case when the input comprises a list of sentences.
What is the recommended approach? Should I split the input documents into sentences, and run the tokenizer on each of them? Or does the Transformer model handle the continuous stream of sentences?
I have seen posts like this:
And:
However, it is not clear to me if this applied for a standard pipeline.
hey @carschno, how long are your documents (on average) and what kind of performance do you get with the current approach (i.e. tokenizing + truncating the whole document)?
if performance is a problem then since you’re doing text classification, you could try chunking each document into smaller passages with a sliding window (see this tutorial for details on how that’s done), and then aggregate the [CLS] representations for each window in a manner similar to this paper.
it’s not the most memory efficient strategy, but if your documents are not super long it might be a viable alternative to simple truncation
Hi @lewtun ,
Thanks for the suggestion, I think the sliding window approach looks very promising indeed.
My documents are body texts from scraped web pages, so the length varies widely, from a few tokens till very long texts.
Current performance of the fine-tuned classifier is around 0.8-0.9 accuracy (2 classes). I’ll have to look deeper into whether this is good enough for my application, and do more data analysis.
Anyway for clarification: i understand the tokenizer behavior I described above is expected and the Bert model is supposed to handle input texts with multiple sentences in a single string well, right?
Yes you’re totally right . From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case). In case you want to see the tokens directly, you can use the tokenizer’s convert_ids_to_tokens function on the input_ids returned by the tokenizer
they’re similar, but convert_ids_to_tokens let’s you see the subwords, e.g.:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
txt = "The Higgs field gives mass to subatomic particles"
txt_enc = tokenizer(txt)
tokenizer.convert_ids_to_tokens(txt_enc.input_ids) # returns ['[CLS]', 'the', 'hi', '##ggs', 'field', 'gives', 'mass', 'to', 'sub', '##ato', '##mic', 'particles', '[SEP]']
tok.decode(txt_enc.input_ids) # returns "[CLS] the higgs field gives mass to subatomic particles [SEP]"
the decode method combines the subwords to produce a single string and adds the special tokens as in your example. i find both method quite handy for debugging!
From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case).
Am I correct that if I have data that doesn’t exceed max_seq_length but each observation consists of 2 or more sentences, establishing the sentence separation isn’t necessary (BERT, DistilBERT)? I guess I’m confused about add_special_tokens=True option in case if sentence separation doesn’t matter.