All my sequences get tokenized the same

timaie · February 12, 2022, 5:16pm

I’m using ProtBert models (GitHub - agemagician/ProtTrans: ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.) for protein embeddings. When using the BertTokenizer (and similarly for T5Tokenizer), all my sequences get tokenized to [2, 1, 3]. Does anyone know what’s going wrong?

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
sequences = ['MEGGGKPNSASNSRDDGNSVYPSKAPATGPAAAD', 'MQLKAKEELLRNMELGLIPDQEIRQLIRVE', 'MTVSTSKTPKKNIKYTLTHTLQKWKETLKKITHETLSSI']

tokenizer(sequences)

gves a result like

{'input_ids': [[2, 1, 3], [2, 1, 3], [2, 1, 3]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1]]}

beneyal · February 12, 2022, 5:22pm

Hi there!

You need to use spaces between the string characters. [2, 1, 3] is [CLS] [UNK] [SEP].

With spaces, it looks better

>>> tokenizer(' '.join(list("MEGGGKPNSASNSRDDGNSVYPSKAPATGPAAAD")))

{'input_ids': [2, 21, 9, 7, 7, 7, 12, 16, 17, 10, 6, 10, 17, 10, 13, 14, 14, 7, 17, 10, 8, 20, 16, 10, 12, 6, 16, 6, 15, 7, 16, 6, 6, 6, 14, 3],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

timaie · February 12, 2022, 5:24pm

I just also realized that, thanks anyway!

Topic	Replies	Views
How tokenize natural words by using Tokenizer from transformer pretrained models 🤗Transformers	223	November 23, 2022
BERT (CamemBERT) for Sequence Classification maps any sequence to the exact same encoding Models	206	July 7, 2023
Custom Tokenizing? 🤗Tokenizers	240	March 19, 2024
How to decode with spaces? 🤗Tokenizers	1859	April 28, 2022
Fine-tuning a masked language model Beginners	354	February 2, 2022

All my sequences get tokenized the same

Related topics