HuggingFace has great support for hyperparameter optimization. This is tied to a Trainer, which means that a masked language model pretraining would be optimized separately from fine tuning (in my case, classification). This means that the best model with regards to the pretraining objective is fixed, and so only a subset of training parameters can be optimized during fine tuning (a greedy search). The implicit assumption here is that the best parameters for the pretraining task (e.g. number of hidden layers) should be the best also for the fine tuning task.
By contrast, I would like to optimize hyperparameters end-to-end (a loop would involve both pretraining and fine tuning), where the metrics selecting the best model architecture come from the performance on the classification task. This can be achieved with custom code, but I was wondering whether this use case is appealing enough for the average user to be considered as a potential new feature from the HuggingFace team?
Thanks for your consideration.