Training tokenizers with padding in between tokens

akanakia · October 19, 2023, 6:43pm

Hi folks,

I am trying to train a set of tokenizers (BPE, WordPiece, and Unigram) on a dataset containing antibody sequences. The issue is that antibody sequences are usually pre-aligned using some aligment scheme like cabat or IMGT. This results in padding tokens being introduced in the middle of the sequence which never happens when tokenizing text. E.g., An aligned antibody sequence can look something like this “QVQT–TYHHH ASTR-MTPY Q-----QY”, with “-” being the pad token introduced during sequence alignment.

I would like the tokenizers to essentially treat the pad token as an “unk” token during pretraining but there is already an unk token in the initial vocab, usually denoted by “X” representing an unknown amino acid in a sequence. Is there some way to enforce learned tokens do no contain any padding characters in them using huggingface.tokenizers? Any help would be really appreciated. Thanks.

Topic		Replies	Views
Tokenizer.pad_token=what? 🤗Tokenizers	2	10107	November 8, 2022
Asking to pad but the tokenizer does not have a padding token 🤗Tokenizers	0	1716	May 6, 2024
Trained tokenizer API as PretrainedTokenizer 🤗Tokenizers	1	524	October 25, 2022
Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers 🤗Tokenizers	0	498	June 12, 2023
Unk_token not set after training a BPETokenizer tokenizer 🤗Tokenizers	1	604	November 1, 2023

Training tokenizers with padding in between tokens

Related topics