Why does padding = 'max_length' cause much slower model inference?

I trained a bert-based-uncase AutoModelForSequenceClassification model and found that model inference is at least 2x faster if I comment out padding = ‘max_length’ in the encode step. My understanding is BERT expects a fix length of 512 tokens… doesn’t that imply input must be padded to 512? Please help me understand.

sequence = tokenizer.encode_plus(question,
                                        passage,
                                        max_length = 256,
                                        padding = 'max_length',
                                        truncation = \"longest_first\",
                                        return_tensors=\"pt\")['input_ids'].to(device)
2 Likes

In the same boat. We switched added max_length=512 and our training time went from 6 minutes to 1 hour on a 4090