Python crashes without error message when I try to use this custom tokenizer

qingyanz · December 2, 2021, 6:59am

I’m hoping to retrain a GPT-2 model from scratch, where the sentences are protein chains, and the words are single-ASCII-character representation of amino acids, e.g. “A” for alanine and “B” for asparagine. There are no spaces or other separators between words.

Due to constraints in other parts of my code, I would strongly prefer to have single ASCII characters for my special tokens as well. I suspect this requirement is the root of my problem - Python hangs and then crashes without an error message when I try to use this minimal tokenizer. Maybe I used a forbidden character that’s not documented as a special token?

Minimal reproducible code:

import numpy as np
import torch
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(Unigram())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.add_tokens(['I', 'L', 'V', 'F', 'M', 'C', 'A', 'G', 'P', 'T', 'S', 'Y', 'W', 'Q', 'N', 'H', 'E', 'D', 'K', 'R', 'J', 'U', 'O'])

tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer,
                                    bos_token='>',
                                    eos_token='=',
                                    unk_token='X',
                                    pad_token='_')

sequences = ['>RNLYYYGRPDYW=>FGGSENATNLFLLELLGAGE=',
             '>RNLYYYGRPDYW=>TLPLSLPTSAQDSNFSVKTE=',
             '>CTGGSSWYVPDYW=>PNT=']

tokenizer(sequences,
          return_tensors="pt",
          padding='longest')
# Python hangs and crashes here

qingyanz · December 3, 2021, 2:14am

It’s on me; the issue was solved with a single line of code:

tokenizer.add_special_tokens(['>', '=', 'X', '_'])

Topic		Replies	Views
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings 🤗Tokenizers	4	384	October 4, 2023
Error with <\|endoftext\|> in Tokenizer GPT2 🤗Tokenizers	2	6763	December 16, 2020
How to fine-tune GPT on my own data for text generation Beginners	0	1935	January 17, 2022
Trained a tokenizer from scratch but problem when loading 🤗Transformers	0	367	October 8, 2023
Padding not working when loading a tokenizer trained via the tokenizers library into transformers 🤗Transformers	1	4433	June 11, 2023

Python crashes without error message when I try to use this custom tokenizer

Related Topics