Best way to get the closest token indices of input of char_to_token is a whitespace

SantoshGupta · February 19, 2023, 3:44am

For many text datasets, often the character spans start or end at a whitespace, and if you put that into char_to_token , it will return None since the whitespace does not correspond to a token.

Is there a way to get the closest token on either side of the white space? Or an indicator if the character span is out of range [in cases where chunking and striding where the character span may not be in the overflow tokens],

I tried to see if there’s some variable in the class or function that I can play around with in the source code

github.com

huggingface/transformers/blob/v4.26.1/src/transformers/tokenization_utils_base.py#L537


      
                  raise ValueError("token_to_chars() is not available when using Python based tokenizers")
              if token_index is not None:
                  batch_index = batch_or_token_index
              else:
                  batch_index = 0
                  token_index = batch_or_token_index
              span_indices = self._encodings[batch_index].token_to_chars(token_index)
          
          
    return CharSpan(*span_indices) if span_indices is not None else None
          
          
def char_to_token(
              self, batch_or_char_index: int, char_index: Optional[int] = None, sequence_index: int = 0
          ) -> int:
              """
              Get the index of the token in the encoded output comprising a character in the original string for a sequence
              of the batch.
          
          
    Can be called as:
          
          
    - `self.char_to_token(char_index)` if batch size is 1
              - `self.char_to_token(batch_index, char_index)` if batch size is greater or equal to 1

But it seems that it goes into the RUST part of the tokenizers library.

My work around is to convert every character in the span to token indices and then take the max and min.

Topic		Replies	Views
How to map generated characters to tokens? 🤗Transformers	0	479	September 21, 2022
Return_offsets_mapping when decoding 🤗Tokenizers	3	31	April 25, 2025
Most effiecient way to move padding tokens to the right side of a tensor? 🤗Transformers	2	117	December 18, 2024
SQuAD with BERT tokenizer: Mismatch between span and token boundaries Models	0	505	November 12, 2021
How to get the index of the masked token after passing the sentence to the model 🤗Transformers	3	2818	September 8, 2020

Best way to get the closest token indices of input of char_to_token is a whitespace

Related topics