Python crashes without error message when I try to use this custom tokenizer

I’m hoping to retrain a GPT-2 model from scratch, where the sentences are protein chains, and the words are single-ASCII-character representation of amino acids, e.g. “A” for alanine and “B” for asparagine. There are no spaces or other separators between words.

Due to constraints in other parts of my code, I would strongly prefer to have single ASCII characters for my special tokens as well. I suspect this requirement is the root of my problem - Python hangs and then crashes without an error message when I try to use this minimal tokenizer. Maybe I used a forbidden character that’s not documented as a special token?

Minimal reproducible code:

import numpy as np
import torch
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(Unigram())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.add_tokens(['I', 'L', 'V', 'F', 'M', 'C', 'A', 'G', 'P', 'T', 'S', 'Y', 'W', 'Q', 'N', 'H', 'E', 'D', 'K', 'R', 'J', 'U', 'O'])

tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,


# Python hangs and crashes here

It’s on me; the issue was solved with a single line of code:

tokenizer.add_special_tokens(['>', '=', 'X', '_'])