Questions on model's tokens

ape · March 24, 2021, 1:46am

Hi, I have a few questions regarding tokenizing word/characters/emojis for different huggingface models.

From my understanding, a model would only perform best during inference if the token of the input sentence are within the tokens that the model’s tokenizer was trained on.

My questions are:

is there a way to easily find out if a particular word/emoji is compatible (included during model training) with the model?
if this word/emoji is not was not included during model training, what are the best ways to deal with these words/emojis, such that model inference would give best possible output considering the inclusion of these word/emoji as input. (for 2. it would be nice if it could be answered in the context of my setup below, if possible)

My current setup is as follows:

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
pre_trained_model = 'facebook/bart-large-mnli'
task = 'zero-shot-classification'
candidate_labels = ['happy', 'sad', 'angry', 'confused']
tokenizer = AutoTokenizer.from_pretrained(pre_trained_model)
model = AutoModelForSequenceClassification.from_pretrained(pre_trained_model)
zero_shot_classifier = pipeline(model=model, tokenizer=tokenizer, task=task)

zero_shot_classifier('today is a good day 😃', candidate_labels=candidate_labels)

Any help is appreciated

Topic		Replies	Views
Special tokens and inference Intermediate	0	333	November 16, 2020
Emojis poisoning tokenizer 🤗Tokenizers	0	133	June 17, 2024
Tokenizer method inference 🤗Tokenizers	3	47	November 2, 2024
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	423	June 29, 2021
Using bert tokenizer in Electra model 🤗Transformers	0	352	September 27, 2021

Questions on model's tokens

Related topics