Qwen tokenizer, omit Ġ from offset_mapping

hfnlpmb · July 17, 2025, 5:19pm

I want to match words in text with the tokens created by Qwen’s tokenizer based on their text spans. I can get the text spans of the tokens from the offset_mapping but Qwen3-Embedding-0.6B tokenizer adds the Ġ symbol for space, which is also added in the offset_mapping. Can I omit the effect of Ġ/ spacein the offset_mapping from the library, or should I do it manually?

John6666 · July 18, 2025, 4:20am

How about like this?

from transformers import AutoTokenizer

model_ids = ["Qwen/Qwen3-Embedding-0.6B", "google-bert/bert-base-cased"]

for model_id in model_ids:
    # 1. Prepare tokenizer and text
    tokenizer = AutoTokenizer.from_pretrained(model_id, add_prefix_space=False)
    text      = "A girl is styling her hair."

    # 2. Encode with offsets
    encoding  = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
    tokens    = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
    offsets   = encoding["offset_mapping"]

    # 3. Adjust spans: skip the leading space for 'Ġ' tokens
    corrected_offsets = []
    for tok, (start, end) in zip(tokens, offsets):
        if tok.startswith("Ġ") and start < len(text) and text[start] == " ":
            start += 1
        corrected_offsets.append((start, end))

    print(model_id)
    print(tokens)
    print("Original:", offsets)
    print("Corrected:", corrected_offsets)

"""
Qwen/Qwen3-Embedding-0.6B
['A', 'Ġgirl', 'Ġis', 'Ġstyling', 'Ġher', 'Ġhair', '.']
Original: [(0, 1), (1, 6), (6, 9), (9, 17), (17, 21), (21, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
google-bert/bert-base-cased
['A', 'girl', 'is', 'styling', 'her', 'hair', '.']
Original: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
"""

github.com/huggingface/transformers

Qwen3 tokenizer wrong offset_mapping

opened 02:21PM - 14 Jul 25 UTC

closed 09:59AM - 16 Jul 25 UTC

contribcode

bug

### System Info transformers 4.53.2, Ubuntu 22.04.4, python 3.11.13 ### Who ca…n help? @ArthurZucker and @itazap There must be a problem with the `offset_mapping` of Qwen3 `tokenizer`. The starting point in the text for each token, except the first and the last, is one position behind. I compared it with the BERT's `tokenizer`, which produces what is expected: ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction ``` sample_text='A girl is styling her hair.' bert_tokenizer = BertTokenizerFast.from_pretrained('google-bert/bert-base-cased') bert_encoding = bert_tokenizer( text=sample_text, add_special_tokens=False, return_offsets_mapping=True ) print(bert_encoding['offset_mapping']) qwen_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B') qwen_encoding = qwen_tokenizer( text=sample_text, add_special_tokens=False, return_offsets_mapping=True ) print(qwen_encoding['offset_mapping']) ``` ### Expected behavior [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)] [(0, 1), (1, 6), (6, 9), (9, 17), (17, 21), (21, 26), (26, 27)]

hfnlpmb · July 18, 2025, 9:58am

@John6666 thank you for your reply.

Yes, if the text is clean, it can be done like this. If for example, there are multiple whites paces within the text, there will be tokens that represent white space only with the first space having been removed. This is why I asked if the transformers library has a functionality not to include the Ġ character in tokenization, or it should be done manually like the answer that you provided.

John6666 · July 18, 2025, 10:39am

Looking at the GitHub issue above and old posts, it seems like there’s no way to do this…
It might be quicker to raise an issue in the Transoformers repo…

Topic		Replies	Views
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	337	October 20, 2021
Offset mappings differ for tokenizers 🤗Tokenizers	0	1718	October 30, 2023
Ġ token inserted by ByteLevelBPETokenizer 🤗Transformers	0	545	November 1, 2023
Issues with offset_mapping values 🤗Tokenizers	4	4491	February 15, 2022
GPT2TokenizerFast tokenzied output Beginners	0	154	December 29, 2023

Qwen tokenizer, omit Ġ from offset_mapping

Related topics