I’m using ProtBert models (GitHub - agemagician/ProtTrans: ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.) for protein embeddings. When using the BertTokenizer (and similarly for T5Tokenizer), all my sequences get tokenized to [2, 1, 3]. Does anyone know what’s going wrong?
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
sequences = ['MEGGGKPNSASNSRDDGNSVYPSKAPATGPAAAD', 'MQLKAKEELLRNMELGLIPDQEIRQLIRVE', 'MTVSTSKTPKKNIKYTLTHTLQKWKETLKKITHETLSSI']
tokenizer(sequences)
gves a result like
{'input_ids': [[2, 1, 3], [2, 1, 3], [2, 1, 3]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1]]}