How could protein language models generate outputs for natural language input texts?

tuln128 · November 17, 2023, 6:46am

Hi all,
The following example was taken from the introductory page of ESM language models (ESM).

tokenizer = AutoTokenizer.from_pretrained(“facebook/esm2_t6_8M_UR50D”)
model = EsmForSequenceClassification.from_pretrained(“facebook/esm2_t6_8M_UR50D”)
inputs = tokenizer(“Hello, my dog is cute”, return_tensors=“pt”)
with torch.no_grad():

       logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

The funny thing is that, ESM (Evolutionary Scale Modeling) are Transformers protein language models, and the input sentence is in natural language (“Hello, my dog is cute”).

Is there anyone having ideas about how the models could still generate outputs without any warnings, and how the given results (logits values) should be interpreted?

Thanks a lot in advance!

panigrah · November 17, 2023, 5:20pm

The functions are just performing some calculations on the input.

They take some text, convert it into numbers with the tokeniser and do some math in the model and return a bunch of numbers that can be converted back into text.

This process will be carried out irrespective of what text string you pass into it.

But outputs will be meaningful only if the inputs are protein strings because the model has been trained to understand these strings and work on them.

If you pass it random strings the output will likely just be gibberish.

Another way to think about is that the model won’t complain because it has not been trained to complain.

tuln128 · November 20, 2023, 12:54am

I see, thank you very much for your clarification.
One more thing I was wondering is that, the dictionary of the protein model only contains around 20 different words (corresponding to different amino acids), while texts in NLP may contain words from thousand entry dictionary. Do you have any idea about how the tokenizer performs input encoding in this case?
Kind regards,

panigrah · November 20, 2023, 5:24am

depends on the tokenizer - but in this case marks all words as “Unknown” for inputs that are not in the vocabulary. Here is a workbook on kaggle to show you.

tuln128 · November 21, 2023, 6:23am

Oh, I see. It’s clear! Many thanks!

Topic		Replies	Views
Using transformers (BERT, RoBERTa) without embedding layer Research	8	4152	December 16, 2020
Creating custom language model for sentence classification Beginners	0	259	November 23, 2020
How tokenize natural words by using Tokenizer from transformer pretrained models 🤗Transformers	0	223	November 23, 2022
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
ESM Fast Tokenizer Beginners	0	135	January 15, 2024

How could protein language models generate outputs for natural language input texts?

Related topics