Transformers for small datasets?

Many transformer models, like BERT and GPT, perform well on large datasets, but what about fine-turning them on smaller and highly specialized datasets? how can one determine the optimal learning rate or batch size for such cases?

1 Like

Following the post with curiosity, great question.

is anybody there who can guide about it and he should also have some information about grooming a show cocker?

I’m there, but it’s too technical to answer…
It’s a subject that someone might write a paper or an article on.