Fine-tuning a pretrained model - how many data samples are needed for effectiveness?

Hello. I’ve been running experiments comparing the performance of a Transformer from Huggingface (ā€œcardiffnlp/twitter-roberta-base-sentiment-latestā€) and OpenAI’s APIs in the task of text classification/sentiment analysis. Due to the OpenAI cost, I’ve been running very small sample sets. The effort is to use the intersection of each model’s inference to basically ā€˜annotate’ the text spans. Out of 100 samples on the Positive class and Negative class, I have achieved F1 scores of roughly 74% and 68% respectively (note, the majority class of Neutral always has a high intersection rate). My questions now:

  1. If I want to fine tune the model that I am using with my ā€˜gold standard’ datasets, will roughly 70 samples for each class be enough to effectively fine tune? Or would I need to get a larger number of utterances in my gold standard dataset to be effective?

  2. Could I use the same transformer as what I used in for the inference intersection study? I think the answer to this is yes, but just want to make sure.

Everyone’s thoughts and critiques are appreciated in my approach. I’m a new ML researcher so I have much to learn.