I have a long text sample, which I’m encoding into windows using
This works fine, but occasionally, I have a very long sample that I want to truncate. For example, with a 20k tokens sample and a 1k
max_length, I get 20 windows,
with a 1m tokens sample and a 1k
max_length (for window), I only want the first 30 windows or 30k tokens. There is no need to tokenize all data.
Currently, as a workaround, I first run the tokenizer with
truncation=true, return_offsets_mapping=True, max_length=30k
and then truncate the text based on the offset_mapping
Then, I tokenize with
return_overflowing_tokens=True to get the windows.
Is there a way to avoid tokenizing twice?