Padding strategy for classification

Isabella · July 15, 2020, 4:03pm

Hello everyone,
I am working on multiclass text classification, currently using XLM-Roberta as a classifier. I have a doubt concerning padding strategies.

My first intuition was to tokenize my training and validation sets separately (as they were two distinct super-batches) using padding = True; the result of this was having all training examples padded to length l1 (the length of the longest sequence in the training set), and validation examples padded to a different l2.

An alternative approach (and the one that seems to be used in the GlueDataset and related methods) is to use padding = max_length, and thus have all examples padded to the same provided length (possibly, 512, which is the maximum sequence length allowed for this model).

Would you mind sharing your thoughts on what strategy might work best and makes more sense from a “theoretical” point of view?

Thank you very much!

chrisdoyleIE · July 16, 2020, 2:46pm

Hi Isabella,

My understanding is that so long as you have your padding mask correctly implemented, the model will not “pay attention” to the pad tokens and so the predictions should be consistent across models (regardless of padding length).

If you do not use a padding mask then the predictions could differ, because the attention weights for the pad tokens will have some affect on your predicted outcome. By having an effect, I mean that the pad tokens contribute to the attention score, and therefore to the loss as a result.

In terms of time complexity, my understanding is somewhat murkier. I suspect that the first method you suggest is faster but it depends on the implementation. If we calculate all attention scores and set those relating to padding to zero, then the max_length version could be much slower. However, I think it is reasonable to assume that the implementation performs a check first i.e. “should I calculate attention here?” which would be only slightly slower than the l1, l2 padding. This point is open to correction!

I could be incorrect in my approach, but I prefer to pad to max_length for the sheer convenience of it. In the l1, l2 approach for example, methods such as cross validation require the repetition of the length calculation.

leoapolonio · July 16, 2020, 3:25pm

The way the model logic is implemented it doesn’t matter.

Specifically because of this function which says only look at word pieces, not special tokens:

github.com

huggingface/transformers/blob/edfd82f5ff179f7600d8f2eea204e21bd07d99e4/src/transformers/modeling_utils.py#L192


    else:
        raise ValueError(
            "{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`".format(
                self.dtype
            )
        )

    return encoder_extended_attention_mask

def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple, device: device) -> Tensor:
    """Makes broadcastable attention mask and causal mask so that future and maked tokens are ignored.

    Arguments:
        attention_mask: torch.Tensor with 1 indicating tokens to ATTEND to
        input_shape: tuple, shape of input_ids
        device: torch.Device, usually self.device

    Returns:
        torch.Tensor with dtype of attention_mask.dtype
    """
    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]

and the output is incorporated here which tells the model what to look at:

github.com

huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L263



query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)

# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
    # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
    attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)

# Mask heads if we want to
if head_mask is not None:

If it’s unclear. Put a breakpoint on those lines and run this unit test: https://github.com/huggingface/transformers/blob/master/tests/test_modeling_bert.py#L521

Isabella · July 20, 2020, 12:59pm

Thank you all for your answers. Whereas I agree that the specific padding strategy should not affect the results, in my implementation (actually, implementations, as I am testing alternative codes) the training results do differ. Perhaps not huge differences, but I can still see them (even though I am keeping all the rest fixed, seeds included).

So I was wondering whether my implementations are wrong or perhaps there is the possibility that somehow inputs padded at different lengths can have a small numerical effect on the gradients…

Thanks again!

Topic		Replies	Views
T5 instruction finetuning Models	0	48	September 9, 2024
Purpose of padding and truncating Beginners	7	3336	August 3, 2020
Padding causes wrong predictions? Beginners	2	1547	August 11, 2021
Why does padding = 'max_length' cause much slower model inference? Models	1	621	June 8, 2023
Whats the maths behind padding_to_longest vs padding_to_model_max_len? Intermediate	1	321	July 20, 2022

Padding strategy for classification

Related topics