Return_offsets_mapping when decoding

BoltzmachineQ · April 24, 2025, 11:52pm

I want to get the character offset mapping of tokens when decoding for model-generated ids, similar to the return_offsets_mapping in tokenizer.__call__. But I cannot find out how to do it. For some tokenizers, I try first decoding the generated ids to text and encoding it back and get the char offset mapping. However , input_ids == tokenizer.encode(tokenizer.decode(input_ids) is not always true.

John6666 · April 25, 2025, 4:43am

.token_to_chars seems useful, but I’m not sure if it will work as intended…

by Hugging Chat: HuggingChat

To obtain the character offset mapping of tokens when decoding model-generated IDs, you can use the token_to_chars method available in many Hugging Face transformers tokenizers. This method allows you to retrieve the character start and end positions for each token in the original text. Here’s how you can implement this:

Step-by-Step Explanation and Solution:

Tokenize the Original Text with Offsets: When you initially tokenize your text, request the offset mappings to retain the character positions of each token.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Your text here"
encoding = tokenizer(text, return_offsets_mapping=True)
input_ids = encoding['input_ids']
offsets = encoding['offset_mapping']

This gives you input_ids and their corresponding character offsets.

Decode the Model-Generated Input IDs: Use the tokenizer’s decode method to convert the model’s output IDs back into text.
```
decoded_text = tokenizer.decode(input_ids)
```

Retrieve Character Offsets During Decoding: To get the character offsets for each token in the decoded text, utilize the token_to_chars method of the tokenizer.

# Example with the first token (index 0)
token_indices = [0, 1, 2]  # Replace with your token indices
for idx in token_indices:
    start, end = tokenizer.token_to_chars(idx, sequence=decoded_text)
    print(f"Token {idx} spans from {start} to {end}.")

Answer:
To get the character offset mappings when decoding model-generated IDs, use the token_to_chars method on your tokenizer instance. This method provides the character positions for each token in the decoded text without requiring re-encoding, thus preserving accuracy. For example:

decoded_text = tokenizer.decode(input_ids)
for idx in range(len(input_ids)):
    start, end = tokenizer.token_to_chars(idx, sequence=decoded_text)
    print(f"Token {idx} spans from character {start} to {end}.")

This approach ensures you accurately track the character offsets for each token during the decoding process. [1][2]

BoltzmachineQ · April 25, 2025, 5:53am

This looks wrong. There is no method called token_to_chars for a tokenizer. It is a method of BatchEncoding. However, BatchEncoding is unavailable in my case because I only have the generated ids by the model.

John6666 · April 25, 2025, 7:26am

Sorry. It was a non-existent function…
None of these methods are very straightforward…

github.com/huggingface/tokenizers

Tokens to Words mapping in the tokenizer decode step?

opened 05:49AM - 02 Oct 20 UTC

closed 02:40PM - 02 Oct 20 UTC

dvirginz

Is there a way to know the mapping from the tokens back to the original words in… the `tokenizer.decode()` function?<br> For example: ```python from transformers.tokenization_roberta import RobertaTokenizer tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True) str = "This is a tokenization example" tokenized = tokenizer.tokenize(str) ## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample'] encoded = tokenizer.encode_plus(str) ## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2] decoded = tokenizer.decode(encoded['input_ids']) ## '<s> this is a tokenization example</s>' ``` And the objective is to have a function that maps each token in the `decode` process to the correct input word, for the above example it will be:<br> `desired_output = [[1],[2],[3],[4,5],[6]]`<br> As `this` corresponds to id `42`, while `token` and `ization` corresponds to ids `[19244,1938]` which are at indexes `4,5` of the `input_ids` array.

Topic	Replies	Views
Offset mappings differ for tokenizers 🤗Tokenizers	1606	October 30, 2023
BUGs on offset-mapping 🤗Tokenizers	170	May 24, 2024
How to map generated characters to tokens? 🤗Transformers	478	September 21, 2022
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	337	October 20, 2021
Tokenizers offset issue Beginners	662	September 8, 2022

Return_offsets_mapping when decoding

Related topics