Doesn't gpt-oss use tiktoken?

This page said gpt2 and llama3 are known models that use tiktoken.

But “gpt-oss model card - section 2.3” also said they use tiktoken as well, and models are released on huggingface hub.

Page is incorrect or some conversion makes gpt-oss don’t use tiktoken anymore?

1 Like

I tested loading it, and it seems safe to assume that tiktoken is being used.


gpt-oss uses tiktoken — the HF page is about a specific Hub file format, not an exhaustive list of “tiktoken-tokenized models”

What OpenAI means (training-time truth)

OpenAI states that gpt-oss is trained using the o200k_harmony tokenizer, open-sourced in tiktoken, and that it has 201,088 tokens. (cdn.openai.com)
That “text → token IDs” mapping is part of the model itself (embedding rows correspond to token IDs), so it doesn’t “stop using tiktoken” just because it’s hosted on Hugging Face.


Why the Hugging Face Transformers doc lists only GPT-2 and Llama-3

The Transformers page you linked is narrowly describing this feature:

Transformers can load a model repo that contains a tokenizer.model file in tiktoken model-file format, and auto-convert it into a fast tokenizer.

It then lists “Known models that were released with a tiktoken.model: gpt2, llama3”. (Hugging Face)

Crucially, that same page also says:

  • tokenizer.model contains no info about additional tokens or pattern strings
  • if those matter, you should convert to tokenizer.json (the proper format for PreTrainedTokenizerFast). (Hugging Face)

So the “known models” list is best read as: known repos that ship a tokenizer.model tiktoken file, not “the only models whose tokenizer corresponds to tiktoken.”


What gpt-oss ships on Hugging Face (and why)

The Hugging Face openai/gpt-oss-120b repo ships tokenizer.json (and related config files) and does not ship tokenizer.model. (Hugging Face)

Why that makes sense for gpt-oss:

  • gpt-oss relies on the Harmony message format with many special tokens. The Harmony docs explicitly say that if you use tiktoken, those special tokens are encoded in o200k_harmony. (OpenAI Developers)
  • Hugging Face’s doc warns that tokenizer.model can’t represent some of the metadata that matters for such tokenizers, and recommends tokenizer.json in that situation. (Hugging Face)

This is not a conversion away from tiktoken. It’s a conversion/serialization of the same mapping into a HF-friendly artifact (tokenizer.json).


Direct answer to your question

  • gpt-oss does use tiktoken (its canonical tokenizer is o200k_harmony). (cdn.openai.com)
  • The Hugging Face page is not saying “only GPT-2 and Llama-3 use tiktoken in general”; it’s listing models known to ship a tokenizer.model tiktoken file. gpt-oss ships tokenizer.json instead. (Hugging Face)
  • So it’s not that “conversion makes gpt-oss not use tiktoken anymore.” It’s mainly packaging + format. (Hugging Face)
"""
Sanity-check: Hugging Face tokenizer.json vs tiktoken o200k_harmony for gpt-oss (token IDs should match).

URLs:
- HF model repo: https://huggingface.co/openai/gpt-oss-120b
- HF docs (tiktoken integration): https://huggingface.co/docs/transformers/en/tiktoken
- OpenAI gpt-oss model card (tokenizer section): https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
- OpenAI Harmony format (special tokens + IDs): https://developers.openai.com/cookbook/articles/openai-harmony/
- tiktoken repo: https://github.com/openai/tiktoken

Deps (low RAM/VRAM; no model weights loaded; CPU/GPU safe; T4 safe):
  pip install -U transformers tiktoken

Notes:
- This compares tokenization only (text -> token IDs). It does NOT download/load 120B weights.
- In notebooks, DO NOT call sys.exit()/raise SystemExit(); just print results and return codes.
- HF warning about HF_TOKEN in Colab is informational; public models still work without auth.
- `tokenizer.vocab_size` typically excludes "added tokens"; `len(tokenizer)` includes them. (Explains 199998 vs larger totals.)
"""

from __future__ import annotations

from typing import List, Dict, Tuple


MODEL_ID = "openai/gpt-oss-120b"
TIKTOKEN_ENCODING = "o200k_harmony"


def first_diff(a: List[int], b: List[int]) -> int:
    """Return first index where lists differ, or -1 if identical."""
    n = min(len(a), len(b))
    for i in range(n):
        if a[i] != b[i]:
            return i
    return -1 if len(a) == len(b) else n


def short_list(xs: List[int], limit: int = 40) -> str:
    return str(xs) if len(xs) <= limit else f"{xs[:limit]} ... (len={len(xs)})"


def get_added_vocab_size(tok) -> int:
    """Best-effort: number of 'added' tokens (special tokens often live here)."""
    # Fast tokenizers typically implement get_added_vocab(); fall back to internal attrs if present.
    if hasattr(tok, "get_added_vocab"):
        try:
            return len(tok.get_added_vocab())
        except Exception:
            pass
    if hasattr(tok, "added_tokens_encoder"):
        try:
            return len(tok.added_tokens_encoder)
        except Exception:
            pass
    return -1  # unknown


def main() -> int:
    # Imports inside main for clearer error messages in notebooks.
    try:
        import tiktoken
    except Exception as e:
        print("ERROR: tiktoken import failed. Install: pip install -U tiktoken")
        print(repr(e))
        return 1

    try:
        from transformers import AutoTokenizer
    except Exception as e:
        print("ERROR: transformers import failed. Install: pip install -U transformers")
        print(repr(e))
        return 1

    print(f"Model: {MODEL_ID}")
    print(f"tiktoken encoding: {TIKTOKEN_ENCODING}")
    print()

    # 1) Load HF tokenizer artifacts (tokenizer.json/config only; no model weights).
    tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
    print(f"HF tokenizer class: {tok.__class__.__name__}")
    print(f"HF is_fast: {getattr(tok, 'is_fast', '(unknown)')}")
    print()

    # IMPORTANT: vocab_size vs len(tokenizer) differs by design.
    hf_vocab_size = getattr(tok, "vocab_size", None)
    hf_len = len(tok)
    hf_added = get_added_vocab_size(tok)
    hf_get_vocab_size = None
    try:
        hf_get_vocab_size = len(tok.get_vocab())
    except Exception:
        pass

    print("== HF vocab accounting ==")
    print(f"tokenizer.vocab_size (base): {hf_vocab_size}")
    print(f"len(tokenizer) (base + added): {hf_len}")
    print(f"added vocab size (best-effort): {hf_added}")
    if hf_get_vocab_size is not None:
        print(f"len(tokenizer.get_vocab()): {hf_get_vocab_size}")
    print()

    # 2) Load tiktoken encoding (canonical for gpt-oss per model card).
    try:
        enc = tiktoken.get_encoding(TIKTOKEN_ENCODING)
    except Exception as e:
        print(f"ERROR: tiktoken.get_encoding('{TIKTOKEN_ENCODING}') failed.")
        print("Try upgrading tiktoken: pip install -U tiktoken")
        print(repr(e))
        return 1

    tt_n_vocab = getattr(enc, "n_vocab", None)
    print("== tiktoken vocab ==")
    print(f"enc.n_vocab: {tt_n_vocab}")
    if tt_n_vocab is not None and hf_len is not None:
        print(f"gap (tiktoken - len(HF tokenizer)): {tt_n_vocab - hf_len}")
    print()

    # 3) Check Harmony special tokens and expected IDs from Harmony spec.
    harmony_specials: Dict[str, int] = {
        "<|return|>": 200002,
        "<|constrain|>": 200003,
        "<|channel|>": 200005,
        "<|start|>": 200006,
        "<|end|>": 200007,
        "<|message|>": 200008,
        "<|call|>": 200012,
    }

    print("== Special token ID checks (HF vs expected vs tiktoken) ==")
    specials_ok = True
    for s, expected_id in harmony_specials.items():
        hf_id = tok.convert_tokens_to_ids(s)

        # allowed_special="all" ensures "<|...|>" maps to special token IDs (not literal text bytes).
        tt_ids = enc.encode(s, allowed_special="all")
        tt_id = tt_ids[0] if len(tt_ids) == 1 else None

        ok = (hf_id == expected_id) and (tt_id == expected_id) and (len(tt_ids) == 1)
        specials_ok = specials_ok and ok
        print(f"{s:12s}  expected={expected_id}  hf={hf_id}  tiktoken={tt_ids}  ok={ok}")
    print()

    # 4) End-to-end text -> IDs parity checks on tricky strings.
    cases: List[str] = [
        "hello world",
        "こんにちは世界",
        " leading space",
        "trailing space ",
        "\nnew\nlines\n",
        "<|return|>",
        "<|call|>",
        "<|start|>user<|message|>Hi<|end|>\n<|start|>assistant",
        "<|channel|>analysis<|message|>test<|end|>",
    ]

    print("== Text->IDs parity checks (HF vs tiktoken) ==")
    parity_ok = True
    for text in cases:
        hf_ids = tok(text, add_special_tokens=False, return_attention_mask=False)["input_ids"]
        tt_ids = enc.encode(text, allowed_special="all")

        ok = hf_ids == tt_ids
        parity_ok = parity_ok and ok
        print(f"[ok={ok}] text={text!r}")
        if not ok:
            i = first_diff(hf_ids, tt_ids)
            print(f"  HF:      {short_list(hf_ids)}")
            print(f"  tiktoken:{short_list(tt_ids)}")
            print(f"  first diff index: {i}")
            # Local neighborhood debug (best-effort; token-level)
            if i >= 0:
                lo, hi = max(0, i - 3), i + 4
                print("  neighborhood:")
                try:
                    print("   HF tokens:     ", tok.convert_ids_to_tokens(hf_ids[lo:hi]))
                except Exception:
                    pass
                try:
                    print("   tiktoken text: ", [enc.decode([t]) for t in tt_ids[lo:hi]])
                except Exception:
                    pass
        print()

    overall_ok = specials_ok and parity_ok
    print("OVERALL:", "PASS" if overall_ok else "FAIL (see above)")
    # Return code (no sys.exit / no raise SystemExit) so notebooks don't show it as an exception.
    return 0 if overall_ok else 2


# Run immediately in a notebook cell:
rc = main()
print(f"Return code: {rc}  (0=pass)")
1 Like

Thanks for the detailed answer.

It answered my question :slight_smile:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.