I want to get the character offset mapping of tokens when decoding for model-generated ids, similar to the return_offsets_mapping
in tokenizer.__call__
. But I cannot find out how to do it. For some tokenizers, I try first decoding the generated ids to text and encoding it back and get the char offset mapping. However , input_ids == tokenizer.encode(tokenizer.decode(input_ids)
is not always true.
.token_to_chars seems useful, but I’m not sure if it will work as intended…
by Hugging Chat: HuggingChat
To obtain the character offset mapping of tokens when decoding model-generated IDs, you can use the token_to_chars
method available in many Hugging Face transformers tokenizers. This method allows you to retrieve the character start and end positions for each token in the original text. Here’s how you can implement this:
Step-by-Step Explanation and Solution:
-
Tokenize the Original Text with Offsets: When you initially tokenize your text, request the offset mappings to retain the character positions of each token.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') text = "Your text here" encoding = tokenizer(text, return_offsets_mapping=True) input_ids = encoding['input_ids'] offsets = encoding['offset_mapping']
This gives you
input_ids
and their corresponding character offsets. -
Decode the Model-Generated Input IDs: Use the tokenizer’s
decode
method to convert the model’s output IDs back into text.decoded_text = tokenizer.decode(input_ids)
-
Retrieve Character Offsets During Decoding: To get the character offsets for each token in the decoded text, utilize the
token_to_chars
method of the tokenizer.# Example with the first token (index 0) token_indices = [0, 1, 2] # Replace with your token indices for idx in token_indices: start, end = tokenizer.token_to_chars(idx, sequence=decoded_text) print(f"Token {idx} spans from {start} to {end}.")
Answer:
To get the character offset mappings when decoding model-generated IDs, use the token_to_chars
method on your tokenizer instance. This method provides the character positions for each token in the decoded text without requiring re-encoding, thus preserving accuracy. For example:
decoded_text = tokenizer.decode(input_ids)
for idx in range(len(input_ids)):
start, end = tokenizer.token_to_chars(idx, sequence=decoded_text)
print(f"Token {idx} spans from character {start} to {end}.")
This approach ensures you accurately track the character offsets for each token during the decoding process. [1][2]
This looks wrong. There is no method called token_to_chars
for a tokenizer. It is a method of BatchEncoding
. However, BatchEncoding
is unavailable in my case because I only have the generated ids by the model.
Sorry. It was a non-existent function…
None of these methods are very straightforward…