Sentence splitting

I am following the Trainer example to fine-tune a Bert model on my data for text classification, using the pre-trained tokenizer (bert-base-uncased).

In all examples I have found, the input texts are either single sentences or lists of sentences. However, my data is one string per document, comprising multiple sentences. When I inspect the tokenizer output, there are no [SEP] tokens put in between the sentences, e.g.:

This is how I tokenize my dataset:

def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')
  

train_dataset = train_dataset.map(encode, batched=True)

And this is an example result of the tokenization:

tokenizer.decode(train_dataset[0]["input_ids"])

[CLS] this is the first sentence . this is the second sentence. [SEP]

Given the special tokens in the beginning and the end, and that the output is lower-cased, I see that the input has been tokenized as expected. However, I was expecting to see a [SEP] between each sentence, as is the case when the input comprises a list of sentences.

What is the recommended approach? Should I split the input documents into sentences, and run the tokenizer on each of them? Or does the Transformer model handle the continuous stream of sentences?

I have seen posts like this:

And:

However, it is not clear to me if this applied for a standard pipeline.

hey @carschno, how long are your documents (on average) and what kind of performance do you get with the current approach (i.e. tokenizing + truncating the whole document)?

if performance is a problem then since you’re doing text classification, you could try chunking each document into smaller passages with a sliding window (see this tutorial for details on how that’s done), and then aggregate the [CLS] representations for each window in a manner similar to this paper.

it’s not the most memory efficient strategy, but if your documents are not super long it might be a viable alternative to simple truncation

Hi @lewtun ,
Thanks for the suggestion, I think the sliding window approach looks very promising indeed.
My documents are body texts from scraped web pages, so the length varies widely, from a few tokens till very long texts.
Current performance of the fine-tuned classifier is around 0.8-0.9 accuracy (2 classes). I’ll have to look deeper into whether this is good enough for my application, and do more data analysis.

Anyway for clarification: i understand the tokenizer behavior I described above is expected and the Bert model is supposed to handle input texts with multiple sentences in a single string well, right?

1 Like

Yes you’re totally right :slight_smile:. From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case). In case you want to see the tokens directly, you can use the tokenizer’s convert_ids_to_tokens function on the input_ids returned by the tokenizer

I actually used the tokenizer’s decode method on the tokenized dataset:

tokenizer.decode(train_dataset[0]["input_ids"])

[CLS] <...> [SEP] [SEP] [PAD] [PAD] <...>

I suppose both are the same.

they’re similar, but convert_ids_to_tokens let’s you see the subwords, e.g.:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
txt = "The Higgs field gives mass to subatomic particles"
txt_enc = tokenizer(txt)

tokenizer.convert_ids_to_tokens(txt_enc.input_ids) # returns ['[CLS]', 'the', 'hi', '##ggs', 'field', 'gives', 'mass', 'to', 'sub', '##ato', '##mic', 'particles', '[SEP]']
tok.decode(txt_enc.input_ids) # returns "[CLS] the higgs field gives mass to subatomic particles [SEP]"

the decode method combines the subwords to produce a single string and adds the special tokens as in your example. i find both method quite handy for debugging!

3 Likes

From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case).

Am I correct that if I have data that doesn’t exceed max_seq_length but each observation consists of 2 or more sentences, establishing the sentence separation isn’t necessary (BERT, DistilBERT)? I guess I’m confused about add_special_tokens=True option in case if sentence separation doesn’t matter.

Sorry for necro posting but just wanted to point out that pyBSD is a great library for sentence tokenization. Be sure to set clean=True when using the Segmenter class.

1 Like