I tested loading it, and it seems safe to assume that tiktoken is being used.
gpt-oss uses tiktoken â the HF page is about a specific Hub file format, not an exhaustive list of âtiktoken-tokenized modelsâ
What OpenAI means (training-time truth)
OpenAI states that gpt-oss is trained using the o200k_harmony tokenizer, open-sourced in tiktoken, and that it has 201,088 tokens. (cdn.openai.com)
That âtext â token IDsâ mapping is part of the model itself (embedding rows correspond to token IDs), so it doesnât âstop using tiktokenâ just because itâs hosted on Hugging Face.
Why the Hugging Face Transformers doc lists only GPT-2 and Llama-3
The Transformers page you linked is narrowly describing this feature:
Transformers can load a model repo that contains a tokenizer.model file in tiktoken model-file format, and auto-convert it into a fast tokenizer.
It then lists âKnown models that were released with a tiktoken.model: gpt2, llama3â. (Hugging Face)
Crucially, that same page also says:
tokenizer.model contains no info about additional tokens or pattern strings
- if those matter, you should convert to
tokenizer.json (the proper format for PreTrainedTokenizerFast). (Hugging Face)
So the âknown modelsâ list is best read as: known repos that ship a tokenizer.model tiktoken file, not âthe only models whose tokenizer corresponds to tiktoken.â
What gpt-oss ships on Hugging Face (and why)
The Hugging Face openai/gpt-oss-120b repo ships tokenizer.json (and related config files) and does not ship tokenizer.model. (Hugging Face)
Why that makes sense for gpt-oss:
- gpt-oss relies on the Harmony message format with many special tokens. The Harmony docs explicitly say that if you use tiktoken, those special tokens are encoded in
o200k_harmony. (OpenAI Developers)
- Hugging Faceâs doc warns that
tokenizer.model canât represent some of the metadata that matters for such tokenizers, and recommends tokenizer.json in that situation. (Hugging Face)
This is not a conversion away from tiktoken. Itâs a conversion/serialization of the same mapping into a HF-friendly artifact (tokenizer.json).
Direct answer to your question
- gpt-oss does use tiktoken (its canonical tokenizer is
o200k_harmony). (cdn.openai.com)
- The Hugging Face page is not saying âonly GPT-2 and Llama-3 use tiktoken in generalâ; itâs listing models known to ship a
tokenizer.model tiktoken file. gpt-oss ships tokenizer.json instead. (Hugging Face)
- So itâs not that âconversion makes gpt-oss not use tiktoken anymore.â Itâs mainly packaging + format. (Hugging Face)
"""
Sanity-check: Hugging Face tokenizer.json vs tiktoken o200k_harmony for gpt-oss (token IDs should match).
URLs:
- HF model repo: https://huggingface.co/openai/gpt-oss-120b
- HF docs (tiktoken integration): https://huggingface.co/docs/transformers/en/tiktoken
- OpenAI gpt-oss model card (tokenizer section): https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
- OpenAI Harmony format (special tokens + IDs): https://developers.openai.com/cookbook/articles/openai-harmony/
- tiktoken repo: https://github.com/openai/tiktoken
Deps (low RAM/VRAM; no model weights loaded; CPU/GPU safe; T4 safe):
pip install -U transformers tiktoken
Notes:
- This compares tokenization only (text -> token IDs). It does NOT download/load 120B weights.
- In notebooks, DO NOT call sys.exit()/raise SystemExit(); just print results and return codes.
- HF warning about HF_TOKEN in Colab is informational; public models still work without auth.
- `tokenizer.vocab_size` typically excludes "added tokens"; `len(tokenizer)` includes them. (Explains 199998 vs larger totals.)
"""
from __future__ import annotations
from typing import List, Dict, Tuple
MODEL_ID = "openai/gpt-oss-120b"
TIKTOKEN_ENCODING = "o200k_harmony"
def first_diff(a: List[int], b: List[int]) -> int:
"""Return first index where lists differ, or -1 if identical."""
n = min(len(a), len(b))
for i in range(n):
if a[i] != b[i]:
return i
return -1 if len(a) == len(b) else n
def short_list(xs: List[int], limit: int = 40) -> str:
return str(xs) if len(xs) <= limit else f"{xs[:limit]} ... (len={len(xs)})"
def get_added_vocab_size(tok) -> int:
"""Best-effort: number of 'added' tokens (special tokens often live here)."""
# Fast tokenizers typically implement get_added_vocab(); fall back to internal attrs if present.
if hasattr(tok, "get_added_vocab"):
try:
return len(tok.get_added_vocab())
except Exception:
pass
if hasattr(tok, "added_tokens_encoder"):
try:
return len(tok.added_tokens_encoder)
except Exception:
pass
return -1 # unknown
def main() -> int:
# Imports inside main for clearer error messages in notebooks.
try:
import tiktoken
except Exception as e:
print("ERROR: tiktoken import failed. Install: pip install -U tiktoken")
print(repr(e))
return 1
try:
from transformers import AutoTokenizer
except Exception as e:
print("ERROR: transformers import failed. Install: pip install -U transformers")
print(repr(e))
return 1
print(f"Model: {MODEL_ID}")
print(f"tiktoken encoding: {TIKTOKEN_ENCODING}")
print()
# 1) Load HF tokenizer artifacts (tokenizer.json/config only; no model weights).
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
print(f"HF tokenizer class: {tok.__class__.__name__}")
print(f"HF is_fast: {getattr(tok, 'is_fast', '(unknown)')}")
print()
# IMPORTANT: vocab_size vs len(tokenizer) differs by design.
hf_vocab_size = getattr(tok, "vocab_size", None)
hf_len = len(tok)
hf_added = get_added_vocab_size(tok)
hf_get_vocab_size = None
try:
hf_get_vocab_size = len(tok.get_vocab())
except Exception:
pass
print("== HF vocab accounting ==")
print(f"tokenizer.vocab_size (base): {hf_vocab_size}")
print(f"len(tokenizer) (base + added): {hf_len}")
print(f"added vocab size (best-effort): {hf_added}")
if hf_get_vocab_size is not None:
print(f"len(tokenizer.get_vocab()): {hf_get_vocab_size}")
print()
# 2) Load tiktoken encoding (canonical for gpt-oss per model card).
try:
enc = tiktoken.get_encoding(TIKTOKEN_ENCODING)
except Exception as e:
print(f"ERROR: tiktoken.get_encoding('{TIKTOKEN_ENCODING}') failed.")
print("Try upgrading tiktoken: pip install -U tiktoken")
print(repr(e))
return 1
tt_n_vocab = getattr(enc, "n_vocab", None)
print("== tiktoken vocab ==")
print(f"enc.n_vocab: {tt_n_vocab}")
if tt_n_vocab is not None and hf_len is not None:
print(f"gap (tiktoken - len(HF tokenizer)): {tt_n_vocab - hf_len}")
print()
# 3) Check Harmony special tokens and expected IDs from Harmony spec.
harmony_specials: Dict[str, int] = {
"<|return|>": 200002,
"<|constrain|>": 200003,
"<|channel|>": 200005,
"<|start|>": 200006,
"<|end|>": 200007,
"<|message|>": 200008,
"<|call|>": 200012,
}
print("== Special token ID checks (HF vs expected vs tiktoken) ==")
specials_ok = True
for s, expected_id in harmony_specials.items():
hf_id = tok.convert_tokens_to_ids(s)
# allowed_special="all" ensures "<|...|>" maps to special token IDs (not literal text bytes).
tt_ids = enc.encode(s, allowed_special="all")
tt_id = tt_ids[0] if len(tt_ids) == 1 else None
ok = (hf_id == expected_id) and (tt_id == expected_id) and (len(tt_ids) == 1)
specials_ok = specials_ok and ok
print(f"{s:12s} expected={expected_id} hf={hf_id} tiktoken={tt_ids} ok={ok}")
print()
# 4) End-to-end text -> IDs parity checks on tricky strings.
cases: List[str] = [
"hello world",
"ăăăŤăĄăŻä¸ç",
" leading space",
"trailing space ",
"\nnew\nlines\n",
"<|return|>",
"<|call|>",
"<|start|>user<|message|>Hi<|end|>\n<|start|>assistant",
"<|channel|>analysis<|message|>test<|end|>",
]
print("== Text->IDs parity checks (HF vs tiktoken) ==")
parity_ok = True
for text in cases:
hf_ids = tok(text, add_special_tokens=False, return_attention_mask=False)["input_ids"]
tt_ids = enc.encode(text, allowed_special="all")
ok = hf_ids == tt_ids
parity_ok = parity_ok and ok
print(f"[ok={ok}] text={text!r}")
if not ok:
i = first_diff(hf_ids, tt_ids)
print(f" HF: {short_list(hf_ids)}")
print(f" tiktoken:{short_list(tt_ids)}")
print(f" first diff index: {i}")
# Local neighborhood debug (best-effort; token-level)
if i >= 0:
lo, hi = max(0, i - 3), i + 4
print(" neighborhood:")
try:
print(" HF tokens: ", tok.convert_ids_to_tokens(hf_ids[lo:hi]))
except Exception:
pass
try:
print(" tiktoken text: ", [enc.decode([t]) for t in tt_ids[lo:hi]])
except Exception:
pass
print()
overall_ok = specials_ok and parity_ok
print("OVERALL:", "PASS" if overall_ok else "FAIL (see above)")
# Return code (no sys.exit / no raise SystemExit) so notebooks don't show it as an exception.
return 0 if overall_ok else 2
# Run immediately in a notebook cell:
rc = main()
print(f"Return code: {rc} (0=pass)")