I recently released grouphug - a package optimized for training on multiple datasets/dataframes at once, with each containing an arbitary subset of tasks, built on transformers/datasets.
The need for this came from wanting a single model to predict many closely related things like message topic, sentiment, toxicity, etc, with the inference speed of a single model, and better accuracy.
I have also found that co-training on a masked language modelling task results in models which generalize very well and do not start overfitting.
Even for single-task modelling, the classification head is also a good deal more powerful than the usual default, and the dataset formatter may be useful to quickly turn your dataframes into the format needed.
Would love to hear if this is useful for anyone else, and any suggestions you have!