How to ensure the `overflow` with `stride` always starts with a full word?

nasheed · January 24, 2022, 3:35am

In continuation to my previous question,
When I tokenize a sentence to max_length while keeping the overflowing tokens which are returned with a stride (overlap) on the previous segment, as is in the code below:

tokenizer= SomeTokenizer.from_pretrained('some/path')
tokenizer('I am Nasheed and I like xylophones.', truncation=True, max_length=12, return_overflowing_tokens=True, stride=7)

The above snippet segments my original string as below:

Segment1: ‘I am Nasheed and I like xylo’
Segment2: ‘heed and I like xylophones.’

I want to ensure that the overflow segments I get always start with whole words as below*:

Segment1: ‘I am Nasheed and I like’
Segment2: ‘Nasheed and I like’
Segment3: ‘and I like xylophones.’

In Segment1 I have removed the partial word xylo, this can be done using what has been suggested here

Topic		Replies	Views
How to ensure that tokenizers never truncate partial words? 🤗Tokenizers	2	1789	January 24, 2022
Token Classification: How to tokenize and align labels with overflow and stride? 🤗Tokenizers	4	6148	July 22, 2024
`return_overflowing_tokens` with something like total_max_length 🤗Transformers	0	528	January 4, 2024
Padding with sliding window 🤗Tokenizers	1	2739	September 3, 2022
Changing Tokenizer's max_length gets weird result Beginners	2	429	May 17, 2022

How to ensure the `overflow` with `stride` always starts with a full word?

Related topics