Hello everyone!
I tried to apply the technique of data augmentation, mixup, popularly used on computer vision, but in this case on NLP.
The algorithm developed is in two phases:
The first phase gets the representation for each sentence of the batch, computing the mean of the correspondent hidden states of the last layer. The fragment below shows the corresponding module.
class LanguageModel(nn.Module):
def __init__(self, pretrained_model_name, device="cuda:0", anonymized_tokens=False):
super(LanguageModel, self).__init__()
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
# Load model
self.config = AutoConfig.from_pretrained(pretrained_model_name)
self.config.output_hidden_states = True
self.model = AutoModel.from_pretrained(pretrained_model_name, config=self.config).to(device)
def forward(self, input_ids, attention_mask):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
)
activations = torch.mean(outputs[0], axis=1)
return activations
After that, it generates the mixup examples using the function proposed on the original code, but being the input, the representations computed on the previous step, instead the images, like originally.
One time the mixup examples are generated, the second phase realizes the predictions (the fragment below shows the corresponding module). Finally, is computed the loss, in the same way as in the original work.
class ClassifierLayer(nn.Module):
def __init__(self, num_classes, dropout_rate=0.1, petrained_size=768, device="cuda:0"):
super(ClassifierLayer, self).__init__()
self.layer = nn.Linear(petrained_size, num_classes, bias=True).to(device)
self.drop = nn.Dropout(dropout_rate)
def forward(self, z):
activations = self.layer(self.drop(z))
return activations
In the fragment of the code below, is shown a summary of the training loop proposed, however the full script used is here:
for idx_epoch in range(0, args.num_train_epochs):
language_model.train()
classifier_layer.train()
accs = 0; ps = 0; rs = 0; f1s = 0; lss = 0
for (idx_batch, train_batch) in enumerate(train_dataloader):
# 0: input_ids, 1: attention_mask, 2:token_type_ids, 3: labels
batch_train = tuple(data_.to(device) for data_ in train_batch)
labels_train = batch_train[-1]
inputs = {
'input_ids': batch_train[0],
'attention_mask': batch_train[1],
}
optimizer.zero_grad()
# 1st phase: conextual embeddings
contextual_embeddings = language_model(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
)
# 2nd phase: mixup
inputs, targets_a, targets_b, lam = mixup_data(contextual_embeddings, labels_train, args.alpha_mixup, use_cuda)
inputs, targets_a, targets_b = map(Variable, (inputs, targets_a, targets_b))
predictions = classifier_layer(inputs)
loss = mixup_criterion(criterion, predictions, targets_a, targets_b, lam)
# 2nd phase: standard
# predictions = classifier_layer(contextual_embeddings)
# loss = criterion(predictions, labels_train)
lss += loss
loss.backward()
optimizer.step()
scheduler.step()
Experimenting with this approach, the results obtained are very poor…
Have any of you worked on an approximation similar to this one with successful/good results?
Thanks.