Hello lovely people!
I wanted to get some advice on the “appropriate” way to treat certain tokens for fine-tuned text classification tasks that I am performing. For these tasks, I am using
Question: If I want to “control” for the occurrence of certain words, is it best to use a
[UNK], or something else?
I fine-tuned a model using several personality items (my training data), each measuring one of five personality traits (example below).
|I enjoy being with people.||extraversion|
|I get irritated easily.||neuroticism|
|I hang around doing nothing.||conscientiousness|
|I have frequent mood changes.||neuroticism|
|I accept apologies easily.||agreeableness|
For my unlabelled “testing” data asked several people to respond to a series of situational prompts—for instance—You’re at work in an office building and your office begins to smell of gas. What would you do? In task 1, I did nothing to the raw text. In task 2, I want to control the tokens that appear in the prompt, essentially making them constant across people, and hopefully not causing too much confusion in the model. Right now my input test set looks something like this (using the example prompt above).
|i would look around to find a possible cause||i [UNK] look around [UNK] find [UNK] [UNK] cause|
|then i would call my boss or the building maintenance if there is any and explain what just happened and that i called 911||then i [UNK] call my boss [UNK] [UNK] [UNK] maintenance if there [UNK] any [UNK] explain [UNK] just happened [UNK] that i called 911|
|first of all i would check where the gas leakage is coming from||first [UNK] all i [UNK] check where [UNK] [UNK] leakage [UNK] coming [UNK]|
Follow-up question: This may be silly but if special tokens (e.g., ‘[SEP]’, ‘[UNK]’, ‘[CLS]’) appear in the raw text (prior to tokenization), will they be tokenized as such or literally?
This is for a research manuscript, and I’m trying to make the case that “the prompt may (drastically) affect how individuals demonstrate their personality through text”. Any advice would be greatly appreciated!
Thanks for being awesome.