I want to match words in text with the tokens created by Qwen’s tokenizer based on their text spans. I can get the text spans of the tokens from the offset_mapping
but Qwen3-Embedding-0.6B
tokenizer
adds the Ġ
symbol for space
, which is also added in the offset_mapping
. Can I omit the effect of Ġ
/ space
in the offset_mapping
from the library, or should I do it manually?
1 Like
How about like this?
from transformers import AutoTokenizer
model_ids = ["Qwen/Qwen3-Embedding-0.6B", "google-bert/bert-base-cased"]
for model_id in model_ids:
# 1. Prepare tokenizer and text
tokenizer = AutoTokenizer.from_pretrained(model_id, add_prefix_space=False)
text = "A girl is styling her hair."
# 2. Encode with offsets
encoding = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
offsets = encoding["offset_mapping"]
# 3. Adjust spans: skip the leading space for 'Ġ' tokens
corrected_offsets = []
for tok, (start, end) in zip(tokens, offsets):
if tok.startswith("Ġ") and start < len(text) and text[start] == " ":
start += 1
corrected_offsets.append((start, end))
print(model_id)
print(tokens)
print("Original:", offsets)
print("Corrected:", corrected_offsets)
"""
Qwen/Qwen3-Embedding-0.6B
['A', 'Ġgirl', 'Ġis', 'Ġstyling', 'Ġher', 'Ġhair', '.']
Original: [(0, 1), (1, 6), (6, 9), (9, 17), (17, 21), (21, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
google-bert/bert-base-cased
['A', 'girl', 'is', 'styling', 'her', 'hair', '.']
Original: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
Corrected: [(0, 1), (2, 6), (7, 9), (10, 17), (18, 21), (22, 26), (26, 27)]
"""
1 Like
@John6666 thank you for your reply.
Yes, if the text is clean, it can be done like this. If for example, there are multiple whites paces within the text, there will be tokens that represent white space only with the first space having been removed. This is why I asked if the transformers
library has a functionality not to include the Ġ
character in tokenization, or it should be done manually like the answer that you provided.
1 Like
Looking at the GitHub issue above and old posts, it seems like there’s no way to do this…
It might be quicker to raise an issue in the Transoformers repo…
1 Like