Fine-tuning a pretrained model - how many data samples are needed for effectiveness?

Hello. Iā€™ve been running experiments comparing the performance of a Transformer from Huggingface (ā€œcardiffnlp/twitter-roberta-base-sentiment-latestā€) and OpenAIā€™s APIs in the task of text classification/sentiment analysis. Due to the OpenAI cost, Iā€™ve been running very small sample sets. The effort is to use the intersection of each modelā€™s inference to basically ā€˜annotateā€™ the text spans. Out of 100 samples on the Positive class and Negative class, I have achieved F1 scores of roughly 74% and 68% respectively (note, the majority class of Neutral always has a high intersection rate). My questions now:

  1. If I want to fine tune the model that I am using with my ā€˜gold standardā€™ datasets, will roughly 70 samples for each class be enough to effectively fine tune? Or would I need to get a larger number of utterances in my gold standard dataset to be effective?

  2. Could I use the same transformer as what I used in for the inference intersection study? I think the answer to this is yes, but just want to make sure.

Everyoneā€™s thoughts and critiques are appreciated in my approach. Iā€™m a new ML researcher so I have much to learn.