Fine-tuning a pretrained model - how many data samples are needed for effectiveness?

Hello. I’ve been running experiments comparing the performance of a Transformer from Huggingface (“cardiffnlp/twitter-roberta-base-sentiment-latest”) and OpenAI’s APIs in the task of text classification/sentiment analysis. Due to the OpenAI cost, I’ve been running very small sample sets. The effort is to use the intersection of each model’s inference to basically ‘annotate’ the text spans. Out of 100 samples on the Positive class and Negative class, I have achieved F1 scores of roughly 74% and 68% respectively (note, the majority class of Neutral always has a high intersection rate). My questions now:

  1. If I want to fine tune the model that I am using with my ‘gold standard’ datasets, will roughly 70 samples for each class be enough to effectively fine tune? Or would I need to get a larger number of utterances in my gold standard dataset to be effective?

  2. Could I use the same transformer as what I used in for the inference intersection study? I think the answer to this is yes, but just want to make sure.

Everyone’s thoughts and critiques are appreciated in my approach. I’m a new ML researcher so I have much to learn.