`return_overflowing_tokens` with something like total_max_length

amitport · January 4, 2024, 9:44am

Hello,

I have a long text sample, which I’m encoding into windows using return_overflowing_tokens=True.

This works fine, but occasionally, I have a very long sample that I want to truncate. For example, with a 20k tokens sample and a 1k max_length, I get 20 windows,
BUT
with a 1m tokens sample and a 1k max_length (for window), I only want the first 30 windows or 30k tokens. There is no need to tokenize all data.

Currently, as a workaround, I first run the tokenizer with
truncation=true, return_offsets_mapping=True, max_length=30k
and then truncate the text based on the offset_mapping
Then, I tokenize with return_overflowing_tokens=True to get the windows.

Is there a way to avoid tokenizing twice?

Topic		Replies	Views
Changing Tokenizer's max_length gets weird result Beginners	2	428	May 17, 2022
What does this warning mean? -overflowing tokens are not returned for the setting you have chosen 🤗Tokenizers	1	5387	March 30, 2022
Why is overflow_to_sample_mapping missing? Beginners	1	1391	February 1, 2023
NER Label tokenization with overflowing tokens 🤗Tokenizers	4	1427	November 6, 2023
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	567	November 25, 2021

`return_overflowing_tokens` with something like total_max_length

Related topics