Hi all,
The following example was taken from the introductory page of ESM language models (ESM).
tokenizer = AutoTokenizer.from_pretrained(“facebook/esm2_t6_8M_UR50D”) model = EsmForSequenceClassification.from_pretrained(“facebook/esm2_t6_8M_UR50D”) inputs = tokenizer(“Hello, my dog is cute”, return_tensors=“pt”) with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
The funny thing is that, ESM (Evolutionary Scale Modeling) are Transformers protein language models, and the input sentence is in natural language (“Hello, my dog is cute”).
Is there anyone having ideas about how the models could still generate outputs without any warnings, and how the given results (logits values) should be interpreted?
The functions are just performing some calculations on the input.
They take some text, convert it into numbers with the tokeniser and do some math in the model and return a bunch of numbers that can be converted back into text.
This process will be carried out irrespective of what text string you pass into it.
But outputs will be meaningful only if the inputs are protein strings because the model has been trained to understand these strings and work on them.
If you pass it random strings the output will likely just be gibberish.
Another way to think about is that the model won’t complain because it has not been trained to complain.
I see, thank you very much for your clarification.
One more thing I was wondering is that, the dictionary of the protein model only contains around 20 different words (corresponding to different amino acids), while texts in NLP may contain words from thousand entry dictionary. Do you have any idea about how the tokenizer performs input encoding in this case?
Kind regards,
depends on the tokenizer - but in this case marks all words as “Unknown” for inputs that are not in the vocabulary. Here is a workbook on kaggle to show you.