Sentence splitting

carschno · April 9, 2021, 3:02pm

I am following the Trainer example to fine-tune a Bert model on my data for text classification, using the pre-trained tokenizer (bert-base-uncased).

In all examples I have found, the input texts are either single sentences or lists of sentences. However, my data is one string per document, comprising multiple sentences. When I inspect the tokenizer output, there are no [SEP] tokens put in between the sentences, e.g.:

This is how I tokenize my dataset:

def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')
  

train_dataset = train_dataset.map(encode, batched=True)

And this is an example result of the tokenization:

tokenizer.decode(train_dataset[0]["input_ids"])

[CLS] this is the first sentence . this is the second sentence. [SEP]

Given the special tokens in the beginning and the end, and that the output is lower-cased, I see that the input has been tokenized as expected. However, I was expecting to see a [SEP] between each sentence, as is the case when the input comprises a list of sentences.

What is the recommended approach? Should I split the input documents into sentences, and run the tokenizer on each of them? Or does the Transformer model handle the continuous stream of sentences?

I have seen posts like this:

And:

However, it is not clear to me if this applied for a standard pipeline.

lewtun · April 9, 2021, 4:08pm

hey @carschno, how long are your documents (on average) and what kind of performance do you get with the current approach (i.e. tokenizing + truncating the whole document)?

if performance is a problem then since you’re doing text classification, you could try chunking each document into smaller passages with a sliding window (see this tutorial for details on how that’s done), and then aggregate the [CLS] representations for each window in a manner similar to this paper.

it’s not the most memory efficient strategy, but if your documents are not super long it might be a viable alternative to simple truncation

carschno · April 10, 2021, 12:00pm

Hi @lewtun ,
Thanks for the suggestion, I think the sliding window approach looks very promising indeed.
My documents are body texts from scraped web pages, so the length varies widely, from a few tokens till very long texts.
Current performance of the fine-tuned classifier is around 0.8-0.9 accuracy (2 classes). I’ll have to look deeper into whether this is good enough for my application, and do more data analysis.

Anyway for clarification: i understand the tokenizer behavior I described above is expected and the Bert model is supposed to handle input texts with multiple sentences in a single string well, right?

lewtun · April 12, 2021, 8:57pm

Yes you’re totally right . From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case). In case you want to see the tokens directly, you can use the tokenizer’s convert_ids_to_tokens function on the input_ids returned by the tokenizer

carschno · April 13, 2021, 7:14am

I actually used the tokenizer’s decode method on the tokenized dataset:

tokenizer.decode(train_dataset[0]["input_ids"])

[CLS] <...> [SEP] [SEP] [PAD] [PAD] <...>

I suppose both are the same.

lewtun · April 13, 2021, 9:31am

they’re similar, but convert_ids_to_tokens let’s you see the subwords, e.g.:

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
txt = "The Higgs field gives mass to subatomic particles"
txt_enc = tokenizer(txt)

tokenizer.convert_ids_to_tokens(txt_enc.input_ids) # returns ['[CLS]', 'the', 'hi', '##ggs', 'field', 'gives', 'mass', 'to', 'sub', '##ato', '##mic', 'particles', '[SEP]']
tok.decode(txt_enc.input_ids) # returns "[CLS] the higgs field gives mass to subatomic particles [SEP]"

the decode method combines the subwords to produce a single string and adds the special tokens as in your example. i find both method quite handy for debugging!

hhaven · June 3, 2021, 1:49pm

From the tokenizer’s perspective, it doesn’t matter if the input string is composed of one or more sentences - it will split it into words/subwords according to the underlying tokenization algorithm (WordPiece in BERT’s case).

Am I correct that if I have data that doesn’t exceed max_seq_length but each observation consists of 2 or more sentences, establishing the sentence separation isn’t necessary (BERT, DistilBERT)? I guess I’m confused about add_special_tokens=True option in case if sentence separation doesn’t matter.

Kwame · September 15, 2022, 9:57pm

Sorry for necro posting but just wanted to point out that pyBSD is a great library for sentence tokenization. Be sure to set clean=True when using the Segmenter class.

Topic		Replies	Views
Combine multiple sentences together during tokenization 🤗Tokenizers	3	5682	February 4, 2022
Token classification on long sentences 🤗Transformers	0	845	February 2, 2022
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6725	February 9, 2024
NER for chunks / sentences 🤗Transformers	4	2395	February 12, 2021
How can I implement this BERT model for sequential sentences classification using HuggingFace? Beginners	1	804	September 10, 2023

Sentence splitting

Related topics