Hello,
I have a long text sample, which I’m encoding into windows using return_overflowing_tokens=True
.
This works fine, but occasionally, I have a very long sample that I want to truncate. For example, with a 20k tokens sample and a 1k max_length
, I get 20 windows,
BUT
with a 1m tokens sample and a 1k max_length
(for window), I only want the first 30 windows or 30k tokens. There is no need to tokenize all data.
Currently, as a workaround, I first run the tokenizer with
truncation=true, return_offsets_mapping=True, max_length=30k
and then truncate the text based on the offset_mapping
Then, I tokenize with return_overflowing_tokens=True
to get the windows.
Is there a way to avoid tokenizing twice?