Hi, I was wondering if anyone could provide any insight into the most effective/efficient way of implementing finetuning with the HuggingFace Trainer module while incorporating two distinct datasets. Basically, I am looking for a way to have it so that the finetuning process goes over two distinct sets of text such that each epoch guarantees traversing over the entirety of both of them.Moreover, the implementation should allow for distinct batch sizes for each set (e.g. 32 and 128).
Override the Trainer.get_train_dataloader() method to return a custom iterator that uses two different dataloader, one for each dataset. Iterate over each set per epoch and you can handle the batch size difference… This would give you a lot of the flexibility that you can control and implement using the given library code. Then you would just have to extend out the trainer class with a multi dataset trainer class
@Mdrnfox Hmm, I’m still confused as to how the two dataloders would be taken into account. For instance, the compute_loss function takes in a single parameter as input. I assume that this must be overridden to have two input parameters?
Moreover, that would mean that the function that calls the compute_loss function (training_step()) would have to take two inputs as well, no?
By design, compute_loss() is called with a single inputs dict per batch. inputs is what comes from your train_dataloader(), one batch at a time. So only one batch is processed per call to training_step(), which then calls compute_loss().
class MultiDatasetTrainer(Trainer):
def __init__(self, dataset_a=None, dataset_b=None, bs_a=32, bs_b=128, *args, **kwargs):
super().__init__(*args, **kwargs)
self.dataset_a = dataset_a
self.dataset_b = dataset_b
self.bs_a = bs_a
self.bs_b = bs_b
def get_train_dataloader(self):
loader_a = DataLoader(
self.dataset_a,
batch_size=self.bs_a,
shuffle=True,
collate_fn=self.data_collator,
)
loader_b = DataLoader(
self.dataset_b,
batch_size=self.bs_b,
shuffle=True,
collate_fn=self.data_collator,
)
#You'll need to make this or something like this
return CombinedLoader(loader_a, loader_b, mode="sequential")
def compute_loss(self, model, inputs, return_outputs=False):
source = inputs.pop("source")
outputs = model(**inputs)
loss = outputs.loss
# Optional: weight dataset B less
if source == "B":
loss = loss * ( a number)
return (loss, outputs) if return_outputs else loss