Hello everyone,
I am working with CamemBERT and I am facing a strange issue with the tokenizer.
Here is my setup:
-
I am using CamembertTokenizerFast
from Hugging Face.
-
I first perform a pre-tokenization step with the tokenizer (to store tokens or prepare data).
-
Later, I try to use these tokenized files for a classification task.
The problem:
The special marker that indicates the beginning of a subword (▁
from SentencePiece) is not encoded in the same way when I pre-tokenize compared to what the model expects during training.
As a result, when I reuse my pre-tokenized files, the model does not recognize the subword boundaries properly (the “▁” is not treated as expected).
1 Like
In the case of SentencePiece tokenizer, it seems that the simplest method is to keep the raw text without pre-tokenizing and tokenize it just before use…
# pip install -U transformers
# https://github.com/huggingface/transformers/issues/5087
# https://github.com/huggingface/transformers/issues/12308
# https://discuss.huggingface.co/t/added-tokens-not-decoding-with-spaces/10883
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("camembert-base")
# A no-space string that splits into multiple subpieces
text = "anticonstitutionnellement"
# Baseline: raw text → ids
ids_raw = tok(text, add_special_tokens=True)["input_ids"]
# Precompute subword strings (first piece starts with '▁', suffix pieces don't)
pieces = tok.tokenize(text)
assert any(not p.startswith("▁") for p in pieces), pieces # ensure true subwording
# ✅ Correct reuse of pre-tokenized subword STRINGS: map directly to ids, then add specials
ids_right = tok.build_inputs_with_special_tokens(tok.convert_tokens_to_ids(pieces))
# ❌ Wrong 1: re-tokenize subword strings as if they were text (adds spaces, changes boundaries)
ids_wrong_text = tok(" ".join(pieces), add_special_tokens=True)["input_ids"]
# ❌ Wrong 2: treat subword strings as "words" (this flag expects words; it will re-tokenize)
ids_wrong_words = tok(pieces, is_split_into_words=True, add_special_tokens=True)["input_ids"]
print("pieces:", pieces) # pieces: ['▁anti', 'c', 'onstitutionnelle', 'ment']
print("raw == right ?", ids_raw == ids_right) # raw == right ? True
print("raw == wrong_text ?", ids_raw == ids_wrong_text) # raw == wrong_text ? False
print("raw == wrong_words?", ids_raw == ids_wrong_words) # raw == wrong_words? False
1 Like