Qwen tokenizer, omit Ġ from offset_mapping

I want to match words in text with the tokens created by Qwen’s tokenizer based on their text spans. I can get the text spans of the tokens from the offset_mapping but Qwen3-Embedding-0.6B tokenizer adds the Ġ symbol for space, which is also added in the offset_mapping. Can I omit the effect of Ġ/ spacein the offset_mapping from the library, or should I do it manually?

1 Like

How about like this?

from transformers import AutoTokenizer

model_ids = ["Qwen/Qwen3-Embedding-0.6B", "google-bert/bert-base-cased"]

for model_id in model_ids:
    # 1. Prepare tokenizer and text
    tokenizer = AutoTokenizer.from_pretrained(model_id, add_prefix_space=False)
    text      = "A girl is styling her hair."

    # 2. Encode with offsets
    encoding  = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
    tokens    = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
    offsets   = encoding["offset_mapping"]

    # 3. Adjust spans: skip the leading space for 'Ġ' tokens
    corrected_offsets = []
    for tok, (start, end) in zip(tokens, offsets):
        if tok.startswith("Ġ") and start < len(text) and text[start] == " ":
            start += 1
        corrected_offsets.append((start, end))

    print(model_id)
    print(tokens)
    print("Original:", offsets)
    print("Corrected:", corrected_offsets)

"""
Qwen/Qwen3-Embedding-0.6B
['A', 'Ġgirl', 'Ġis', 'Ġstyling', 'Ġher', 'Ġhair', '.']
Original: [(0, 1), (1, 6), (6, 9), (9, 17), (17, 21), (21, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
google-bert/bert-base-cased
['A', 'girl', 'is', 'styling', 'her', 'hair', '.']
Original: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
"""
1 Like

@John6666 thank you for your reply.

Yes, if the text is clean, it can be done like this. If for example, there are multiple whites paces within the text, there will be tokens that represent white space only with the first space having been removed. This is why I asked if the transformers library has a functionality not to include the Ġ character in tokenization, or it should be done manually like the answer that you provided.

1 Like

Looking at the GitHub issue above and old posts, it seems like there’s no way to do this…
It might be quicker to raise an issue in the Transoformers repo…

1 Like