All my sequences get tokenized the same

I’m using ProtBert models (GitHub - agemagician/ProtTrans: ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.) for protein embeddings. When using the BertTokenizer (and similarly for T5Tokenizer), all my sequences get tokenized to [2, 1, 3]. Does anyone know what’s going wrong?

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
sequences = ['MEGGGKPNSASNSRDDGNSVYPSKAPATGPAAAD', 'MQLKAKEELLRNMELGLIPDQEIRQLIRVE', 'MTVSTSKTPKKNIKYTLTHTLQKWKETLKKITHETLSSI']

tokenizer(sequences)

gves a result like

{'input_ids': [[2, 1, 3], [2, 1, 3], [2, 1, 3]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1]]}

Hi there!

You need to use spaces between the string characters. [2, 1, 3] is [CLS] [UNK] [SEP].

With spaces, it looks better :slightly_smiling_face:

>>> tokenizer(' '.join(list("MEGGGKPNSASNSRDDGNSVYPSKAPATGPAAAD")))

{'input_ids': [2, 21, 9, 7, 7, 7, 12, 16, 17, 10, 6, 10, 17, 10, 13, 14, 14, 7, 17, 10, 8, 20, 16, 10, 12, 6, 16, 6, 15, 7, 16, 6, 6, 6, 14, 3],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I just also realized that, thanks anyway! :slight_smile:

1 Like