Clarifying the use of [UNK] versus [MASK]

sheafyffe · August 9, 2021, 6:59pm

Hello lovely people!

I wanted to get some advice on the “appropriate” way to treat certain tokens for fine-tuned text classification tasks that I am performing. For these tasks, I am using DebertaForSequenceClassification.

Question: If I want to “control” for the occurrence of certain words, is it best to use a [MASK], [UNK], or something else?

More Info
I fine-tuned a model using several personality items (my training data), each measuring one of five personality traits (example below).

text	labels
I enjoy being with people.	extraversion
I get irritated easily.	neuroticism
I hang around doing nothing.	conscientiousness
I have frequent mood changes.	neuroticism
I accept apologies easily.	agreeableness

For my unlabelled “testing” data asked several people to respond to a series of situational prompts—for instance—You’re at work in an office building and your office begins to smell of gas. What would you do? In task 1, I did nothing to the raw text. In task 2, I want to control the tokens that appear in the prompt, essentially making them constant across people, and hopefully not causing too much confusion in the model. Right now my input test set looks something like this (using the example prompt above).

task1_raw_text	task2_raw_text
i would look around to find a possible cause	i [UNK] look around [UNK] find [UNK] [UNK] cause
then i would call my boss or the building maintenance if there is any and explain what just happened and that i called 911	then i [UNK] call my boss [UNK] [UNK] [UNK] maintenance if there [UNK] any [UNK] explain [UNK] just happened [UNK] that i called 911
first of all i would check where the gas leakage is coming from	first [UNK] all i [UNK] check where [UNK] [UNK] leakage [UNK] coming [UNK]

Follow-up question: This may be silly but if special tokens (e.g., ‘[SEP]’, ‘[UNK]’, ‘[CLS]’) appear in the raw text (prior to tokenization), will they be tokenized as such or literally?

This is for a research manuscript, and I’m trying to make the case that “the prompt may (drastically) affect how individuals demonstrate their personality through text”. Any advice would be greatly appreciated!

Thanks for being awesome.

Topic		Replies	Views
Should I normalize text or not Beginners	4	1936	April 26, 2024
Custom tokenizer: finetune model or retrain model? 🤗Transformers	1	918	March 8, 2024
Question about llama fine tuning dataset token string Beginners	1	14	May 17, 2025
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12844	February 12, 2024
Token Classification Model making mistake outside of training dataset Intermediate	0	461	October 30, 2021

Clarifying the use of [UNK] versus [MASK]

Related topics