Hi, I have a few questions regarding tokenizing word/characters/emojis for different huggingface models.
From my understanding, a model would only perform best during inference if the token of the input sentence are within the tokens that the model’s tokenizer was trained on.
My questions are:
-
is there a way to easily find out if a particular word/emoji is compatible (included during model training) with the model?
-
if this word/emoji is not was not included during model training, what are the best ways to deal with these words/emojis, such that model inference would give best possible output considering the inclusion of these word/emoji as input. (for 2. it would be nice if it could be answered in the context of my setup below, if possible)
My current setup is as follows:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
pre_trained_model = 'facebook/bart-large-mnli'
task = 'zero-shot-classification'
candidate_labels = ['happy', 'sad', 'angry', 'confused']
tokenizer = AutoTokenizer.from_pretrained(pre_trained_model)
model = AutoModelForSequenceClassification.from_pretrained(pre_trained_model)
zero_shot_classifier = pipeline(model=model, tokenizer=tokenizer, task=task)
zero_shot_classifier('today is a good day 😃', candidate_labels=candidate_labels)
Any help is appreciated