Implementation of Two Distinct Datasets with HuggingFace Trainer Module

Hi, I was wondering if anyone could provide any insight into the most effective/efficient way of implementing finetuning with the HuggingFace Trainer module while incorporating two distinct datasets. Basically, I am looking for a way to have it so that the finetuning process goes over two distinct sets of text such that each epoch guarantees traversing over the entirety of both of them. Moreover, the implementation should allow for distinct batch sizes for each set (e.g. 32 and 128).

If it is any help, the finetuning process I am trying to implement is called “Unsupervised Data Augmentation”, which is described here: [1904.12848] Unsupervised Data Augmentation for Consistency Training.

1 Like

You could try this I think…

Override the Trainer.get_train_dataloader() method to return a custom iterator that uses two different dataloader, one for each dataset. Iterate over each set per epoch and you can handle the batch size difference… This would give you a lot of the flexibility that you can control and implement using the given library code. Then you would just have to extend out the trainer class with a multi dataset trainer class

1 Like

(post deleted by author)

Would I have to override the .inner_training_loop() method in order to iterate over each set per epoch? Or is there another way to do this?

1 Like

Only if you want to customize updates per gradient , use different optimizers ,conditionally skip batches.

You’d do something like this after you implement the extended class

trainer = MultiDatasetTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
dataset_a=dataset1,
dataset_b=dataset2,
bs_a=32,
bs_b=128,
train_dataset=None, # Must be None to use overridden method
)
trainer.train()

1 Like

@Mdrnfox Hmm, I’m still confused as to how the two dataloders would be taken into account. For instance, the compute_loss function takes in a single parameter as input. I assume that this must be overridden to have two input parameters?

Moreover, that would mean that the function that calls the compute_loss function (training_step()) would have to take two inputs as well, no?

1 Like

By design, compute_loss() is called with a single inputs dict per batch. inputs is what comes from your train_dataloader(), one batch at a time. So only one batch is processed per call to training_step(), which then calls compute_loss().

class MultiDatasetTrainer(Trainer):
    def __init__(self, dataset_a=None, dataset_b=None, bs_a=32, bs_b=128, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset_a = dataset_a
        self.dataset_b = dataset_b
        self.bs_a = bs_a
        self.bs_b = bs_b

    def get_train_dataloader(self):
        loader_a = DataLoader(
            self.dataset_a,
            batch_size=self.bs_a,
            shuffle=True,
            collate_fn=self.data_collator,
        )
        loader_b = DataLoader(
            self.dataset_b,
            batch_size=self.bs_b,
            shuffle=True,
            collate_fn=self.data_collator,
        ) 
        #You'll need to make this or something like this
        return CombinedLoader(loader_a, loader_b, mode="sequential")

    def compute_loss(self, model, inputs, return_outputs=False):
        source = inputs.pop("source")  

        
        outputs = model(**inputs)
        loss = outputs.loss

        # Optional: weight dataset B less
        if source == "B":
            loss = loss * ( a number)

        return (loss, outputs) if return_outputs else loss
2 Likes