Hello. Iāve been running experiments comparing the performance of a Transformer from Huggingface (ācardiffnlp/twitter-roberta-base-sentiment-latestā) and OpenAIās APIs in the task of text classification/sentiment analysis. Due to the OpenAI cost, Iāve been running very small sample sets. The effort is to use the intersection of each modelās inference to basically āannotateā the text spans. Out of 100 samples on the Positive class and Negative class, I have achieved F1 scores of roughly 74% and 68% respectively (note, the majority class of Neutral always has a high intersection rate). My questions now:
-
If I want to fine tune the model that I am using with my āgold standardā datasets, will roughly 70 samples for each class be enough to effectively fine tune? Or would I need to get a larger number of utterances in my gold standard dataset to be effective?
-
Could I use the same transformer as what I used in for the inference intersection study? I think the answer to this is yes, but just want to make sure.
Everyoneās thoughts and critiques are appreciated in my approach. Iām a new ML researcher so I have much to learn.