I am trying to train a set of tokenizers (BPE, WordPiece, and Unigram) on a dataset containing antibody sequences. The issue is that antibody sequences are usually pre-aligned using some aligment scheme like cabat or IMGT. This results in padding tokens being introduced in the middle of the sequence which never happens when tokenizing text. E.g., An aligned antibody sequence can look something like this “QVQT–TYHHH ASTR-MTPY Q-----QY”, with “-” being the pad token introduced during sequence alignment.
I would like the tokenizers to essentially treat the pad token as an “unk” token during pretraining but there is already an unk token in the initial vocab, usually denoted by “X” representing an unknown amino acid in a sequence. Is there some way to enforce learned tokens do no contain any padding characters in them using huggingface.tokenizers? Any help would be really appreciated. Thanks.