Using mixup on RoBERTa

franborjavalero · July 15, 2020, 1:44pm

Hello everyone!

I tried to apply the technique of data augmentation, mixup, popularly used on computer vision, but in this case on NLP.

The algorithm developed is in two phases:

The first phase gets the representation for each sentence of the batch, computing the mean of the correspondent hidden states of the last layer. The fragment below shows the corresponding module.

class LanguageModel(nn.Module):
 
  def __init__(self, pretrained_model_name, device="cuda:0", anonymized_tokens=False):
    super(LanguageModel, self).__init__()
    # Load tokenizer
    self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
    # Load model
    self.config = AutoConfig.from_pretrained(pretrained_model_name)
    self.config.output_hidden_states = True
    self.model = AutoModel.from_pretrained(pretrained_model_name, config=self.config).to(device)

  def forward(self, input_ids, attention_mask):
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
    )
    activations = torch.mean(outputs[0], axis=1)
    return activations

After that, it generates the mixup examples using the function proposed on the original code, but being the input, the representations computed on the previous step, instead the images, like originally.
One time the mixup examples are generated, the second phase realizes the predictions (the fragment below shows the corresponding module). Finally, is computed the loss, in the same way as in the original work.

class ClassifierLayer(nn.Module):
 
  def __init__(self, num_classes, dropout_rate=0.1, petrained_size=768, device="cuda:0"):
    super(ClassifierLayer, self).__init__()
    self.layer = nn.Linear(petrained_size, num_classes, bias=True).to(device)
    self.drop = nn.Dropout(dropout_rate)

  def forward(self, z):
    activations = self.layer(self.drop(z))
    return activations

In the fragment of the code below, is shown a summary of the training loop proposed, however the full script used is here:

    for idx_epoch in range(0, args.num_train_epochs):
        language_model.train()
        classifier_layer.train()
        accs = 0; ps = 0; rs = 0; f1s = 0; lss = 0
        for (idx_batch, train_batch) in enumerate(train_dataloader):
            # 0: input_ids, 1: attention_mask, 2:token_type_ids, 3: labels
            batch_train = tuple(data_.to(device) for data_ in train_batch)
            labels_train = batch_train[-1]
            inputs = {
                'input_ids': batch_train[0],
                'attention_mask': batch_train[1],
            }
            optimizer.zero_grad()
            # 1st phase: conextual embeddings
            contextual_embeddings = language_model(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
            )
            # 2nd phase: mixup
            inputs, targets_a, targets_b, lam = mixup_data(contextual_embeddings, labels_train, args.alpha_mixup, use_cuda)
            inputs, targets_a, targets_b = map(Variable, (inputs, targets_a, targets_b))
            predictions = classifier_layer(inputs)
            loss = mixup_criterion(criterion, predictions, targets_a, targets_b, lam)
            
            # 2nd phase: standard
            # predictions = classifier_layer(contextual_embeddings)
            # loss = criterion(predictions, labels_train)
            
            lss += loss
            loss.backward()
            optimizer.step()
            scheduler.step()

Experimenting with this approach, the results obtained are very poor…

Have any of you worked on an approximation similar to this one with successful/good results?

Thanks.

replydotai · July 15, 2020, 8:55pm

Hi @franborjavalero!

This is really interesting. I remember @sgugger got a little bump using mixup after embeddings with ULMFiT. Would be really awesome to share this code as implementation for this is not trivial.

sgugger · July 15, 2020, 9:08pm

It wasn’t for transformers, but ULMFiT. Didn’t get the chance to try it on transformers model.
Also, I was using the manifold mixup version, which applies the mixup at a random layer (not necessarily the embedding), though this could also mess up the attention mechanism in tansformers.

replydotai · July 15, 2020, 9:16pm

Thanks for sharing @sgugger.
Data augmentation for text classification really is a tough one. Is there anything you consider promising?

@franborjavalero you might want to checkout this thread

sgugger · July 15, 2020, 9:23pm

Haven’t found anything that really stands out for now, so no magic trick on my side

replydotai · July 15, 2020, 9:26pm

Syntactic Data Augmentation Increases Robustness to Inference Heuristics discussed in the other thread seems interesting for NLI

harish · July 21, 2020, 6:20pm

You might find our work on Cost-Sensitivity to be of interest. We found it to be a good alternative to data augmentation. [Paper here and Code here]

Topic		Replies	Views
Replication of the performance of RoBERTa on the COPA task Models	0	450	December 19, 2022
Overlapping data between pre-training and fine-tuning stages 🤗Transformers	0	217	October 8, 2021
Fundamental newbie questions Beginners	1	1074	December 6, 2020
Incremental training on unlabeled data using MLM 🤗Transformers	0	531	December 10, 2022
Multi-Output Regression using Pre-trained LLM (Roberta) 🤗Transformers	1	667	September 18, 2023

Using mixup on RoBERTa

Related Topics