I’m hoping to retrain a GPT-2 model from scratch, where the sentences are protein chains, and the words are single-ASCII-character representation of amino acids, e.g. “A” for alanine and “B” for asparagine. There are no spaces or other separators between words.
Due to constraints in other parts of my code, I would strongly prefer to have single ASCII characters for my special tokens as well. I suspect this requirement is the root of my problem - Python hangs and then crashes without an error message when I try to use this minimal tokenizer. Maybe I used a forbidden character that’s not documented as a special token?
Minimal reproducible code:
import numpy as np import torch from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.pre_tokenizers import Whitespace from transformers import PreTrainedTokenizerFast tokenizer = Tokenizer(Unigram()) tokenizer.pre_tokenizer = Whitespace() tokenizer.add_tokens(['I', 'L', 'V', 'F', 'M', 'C', 'A', 'G', 'P', 'T', 'S', 'Y', 'W', 'Q', 'N', 'H', 'E', 'D', 'K', 'R', 'J', 'U', 'O']) tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer, bos_token='>', eos_token='=', unk_token='X', pad_token='_') sequences = ['>RNLYYYGRPDYW=>FGGSENATNLFLLELLGAGE=', '>RNLYYYGRPDYW=>TLPLSLPTSAQDSNFSVKTE=', '>CTGGSSWYVPDYW=>PNT='] tokenizer(sequences, return_tensors="pt", padding='longest') # Python hangs and crashes here