Issue with CamemBERT tokenizer – inconsistency with subword prefix (▁) between pre-tokenization and training

Hello everyone,

I am working with CamemBERT and I am facing a strange issue with the tokenizer.

Here is my setup:

  • I am using CamembertTokenizerFast from Hugging Face.

  • I first perform a pre-tokenization step with the tokenizer (to store tokens or prepare data).

  • Later, I try to use these tokenized files for a classification task.

:backhand_index_pointing_right: The problem:
The special marker that indicates the beginning of a subword ( from SentencePiece) is not encoded in the same way when I pre-tokenize compared to what the model expects during training.
As a result, when I reuse my pre-tokenized files, the model does not recognize the subword boundaries properly (the “▁” is not treated as expected).

1 Like

In the case of SentencePiece tokenizer, it seems that the simplest method is to keep the raw text without pre-tokenizing and tokenize it just before use…

# pip install -U transformers
# https://github.com/huggingface/transformers/issues/5087
# https://github.com/huggingface/transformers/issues/12308
# https://discuss.huggingface.co/t/added-tokens-not-decoding-with-spaces/10883
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("camembert-base")

# A no-space string that splits into multiple subpieces
text = "anticonstitutionnellement"

# Baseline: raw text → ids
ids_raw = tok(text, add_special_tokens=True)["input_ids"]

# Precompute subword strings (first piece starts with '▁', suffix pieces don't)
pieces = tok.tokenize(text)
assert any(not p.startswith("▁") for p in pieces), pieces  # ensure true subwording

# ✅ Correct reuse of pre-tokenized subword STRINGS: map directly to ids, then add specials
ids_right = tok.build_inputs_with_special_tokens(tok.convert_tokens_to_ids(pieces))

# ❌ Wrong 1: re-tokenize subword strings as if they were text (adds spaces, changes boundaries)
ids_wrong_text = tok(" ".join(pieces), add_special_tokens=True)["input_ids"]

# ❌ Wrong 2: treat subword strings as "words" (this flag expects words; it will re-tokenize)
ids_wrong_words = tok(pieces, is_split_into_words=True, add_special_tokens=True)["input_ids"]

print("pieces:", pieces) # pieces: ['▁anti', 'c', 'onstitutionnelle', 'ment']
print("raw == right      ?", ids_raw == ids_right) # raw == right      ? True
print("raw == wrong_text ?", ids_raw == ids_wrong_text) # raw == wrong_text ? False
print("raw == wrong_words?", ids_raw == ids_wrong_words) # raw == wrong_words? False
1 Like