For many text datasets, often the character spans start or end at a whitespace, and if you put that into char_to_token , it will return None since the whitespace does not correspond to a token.
Is there a way to get the closest token on either side of the white space? Or an indicator if the character span is out of range [in cases where chunking and striding where the character span may not be in the overflow tokens],
I tried to see if there’s some variable in the class or function that I can play around with in the source code
But it seems that it goes into the RUST part of the tokenizers library.
My work around is to convert every character in the span to token indices and then take the max and min.