I have a quite specific corpus and I wanted to try an approach where I do everything from scratch (well, almost!). Specifically, I was thinking about:
language modelfrom scratch using my own corpus and creating my own
tokenizer. Hugginface provides two colab for that (see Google Colab for the LM)
Then, simply loading the language model created at 1. and fine-tuning it for text classification using the usual imports
from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("my_language_model") #do some classification #save final model
Does that make sense? Are these the right conceptual steps?