In continuation to my previous question,
When I tokenize a sentence to max_length
while keeping the overflowing tokens which are returned with a stride
(overlap) on the previous segment, as is in the code below:
tokenizer= SomeTokenizer.from_pretrained('some/path')
tokenizer('I am Nasheed and I like xylophones.', truncation=True, max_length=12, return_overflowing_tokens=True, stride=7)
The above snippet segments my original string as below:
Segment1: ‘I am Nasheed and I like xylo’
Segment2: ‘heed and I like xylophones.’
I want to ensure that the overflow segments I get always start with whole words as below*:
Segment1: ‘I am Nasheed and I like’
Segment2: ‘Nasheed and I like’
Segment3: ‘and I like xylophones.’
- In
Segment1
I have removed the partial wordxylo
, this can be done using what has been suggested here