Using mixup on RoBERTa

Hello everyone!

I tried to apply the technique of data augmentation, mixup, popularly used on computer vision, but in this case on NLP.

The algorithm developed is in two phases:

The first phase gets the representation for each sentence of the batch, computing the mean of the correspondent hidden states of the last layer. The fragment below shows the corresponding module.

class LanguageModel(nn.Module):
 
  def __init__(self, pretrained_model_name, device="cuda:0", anonymized_tokens=False):
    super(LanguageModel, self).__init__()
    # Load tokenizer
    self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
    # Load model
    self.config = AutoConfig.from_pretrained(pretrained_model_name)
    self.config.output_hidden_states = True
    self.model = AutoModel.from_pretrained(pretrained_model_name, config=self.config).to(device)

  def forward(self, input_ids, attention_mask):
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
    )
    activations = torch.mean(outputs[0], axis=1)
    return activations

After that, it generates the mixup examples using the function proposed on the original code, but being the input, the representations computed on the previous step, instead the images, like originally.
One time the mixup examples are generated, the second phase realizes the predictions (the fragment below shows the corresponding module). Finally, is computed the loss, in the same way as in the original work.

class ClassifierLayer(nn.Module):
 
  def __init__(self, num_classes, dropout_rate=0.1, petrained_size=768, device="cuda:0"):
    super(ClassifierLayer, self).__init__()
    self.layer = nn.Linear(petrained_size, num_classes, bias=True).to(device)
    self.drop = nn.Dropout(dropout_rate)

  def forward(self, z):
    activations = self.layer(self.drop(z))
    return activations

In the fragment of the code below, is shown a summary of the training loop proposed, however the full script used is here:

    for idx_epoch in range(0, args.num_train_epochs):
        language_model.train()
        classifier_layer.train()
        accs = 0; ps = 0; rs = 0; f1s = 0; lss = 0
        for (idx_batch, train_batch) in enumerate(train_dataloader):
            # 0: input_ids, 1: attention_mask, 2:token_type_ids, 3: labels
            batch_train = tuple(data_.to(device) for data_ in train_batch)
            labels_train = batch_train[-1]
            inputs = {
                'input_ids': batch_train[0],
                'attention_mask': batch_train[1],
            }
            optimizer.zero_grad()
            # 1st phase: conextual embeddings
            contextual_embeddings = language_model(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
            )
            # 2nd phase: mixup
            inputs, targets_a, targets_b, lam = mixup_data(contextual_embeddings, labels_train, args.alpha_mixup, use_cuda)
            inputs, targets_a, targets_b = map(Variable, (inputs, targets_a, targets_b))
            predictions = classifier_layer(inputs)
            loss = mixup_criterion(criterion, predictions, targets_a, targets_b, lam)
            
            # 2nd phase: standard
            # predictions = classifier_layer(contextual_embeddings)
            # loss = criterion(predictions, labels_train)
            
            lss += loss
            loss.backward()
            optimizer.step()
            scheduler.step()

Experimenting with this approach, the results obtained are very poor…

Have any of you worked on an approximation similar to this one with successful/good results?

Thanks.

1 Like

Hi @franborjavalero!

This is really interesting. I remember @sgugger got a little bump using mixup after embeddings with ULMFiT. Would be really awesome to share this code as implementation for this is not trivial.

It wasn’t for transformers, but ULMFiT. Didn’t get the chance to try it on transformers model.
Also, I was using the manifold mixup version, which applies the mixup at a random layer (not necessarily the embedding), though this could also mess up the attention mechanism in tansformers.

2 Likes

Thanks for sharing @sgugger.
Data augmentation for text classification really is a tough one. Is there anything you consider promising?

@franborjavalero you might want to checkout this thread

1 Like

Haven’t found anything that really stands out for now, so no magic trick on my side :wink:

1 Like

Syntactic Data Augmentation Increases Robustness to Inference Heuristics discussed in the other thread seems interesting for NLI

2 Likes

You might find our work on Cost-Sensitivity to be of interest. We found it to be a good alternative to data augmentation. [Paper here and Code here]

1 Like