I am trying to fine-tune BERT for token classification under the low-resource setting, where I am aiming to use as few samples as possible (this setting is also called active learning). The pipeline is like following
- Step 1: Train on initial labeled training set
- Step 2: Repeat until the annotation budget is reached or performance is good enough
- Add annotated samples to training set based on some criterion
- Retrain the model on the previously saved checkpoints
I have tried the following two implementations
- Initialize
Trainer(...)
every time in Step 2. However, this will cause the model to fine-tune from scratch and therefore seems to be incorrect. - Update
train_dataset
after theTrainer(...)
is initialized. However, from the doc, it seems that there is no way to updatetrain_dataset
after the class is initialized.
Could someone help me with this (two howevers). Thank you for any input!