Fine-Tuning DeBERTa Produces Non-Results

Hi there,

I am currently working on a binary text classification problem. My current baseline is a RoBERTa model tuned over the following search space:

'parameters': {
        'learning_rate': {
            'values': [5e-4, 1e-4, 5e-5, 3e-5] },
        'per_device_train_batch_size': {
            'values': [8, 16, 32, 64] },
        'num_train_epochs': {
            'values': [3, 4, 5] },
    }

Next, I’d like to test whether I can improve over this baseline using microsoft/deberta-v3-small. The only two lines of code I changed to make it run are:

tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small', use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(
    'microsoft/deberta-v3-small', num_labels=2
)

Unfortunately, the DeBERTa model produces non-results for any of the different hyperparameter candidates (i.e., ROC-AUC of .5, F1 of 0), and I just can’t figure out why. Again, switching back to the roberta-base checkpoint everything runs smoothly and the model seems to learn.

Without me reproducing my entire modeling script: Do any differences between the two models come to mind that might necessitate other changes to my codebase except the two above? Grateful for any suggestion!

DeBERTa can be really sensitive to learning rate, so I’d recommend trying lower learning rates.

Are you able to share your training script?

1 Like

@nbroad It appears that this might be indeed the solution. Starting with a learning rate 3e-5 it appears that the model finally produces reasonable outputs! Thanks for the suggestion.

Two quick follow ups:

  • When loading the DeBERTa fast tokenizer, transformers throws the following warning:
    /usr/local/lib/python3.7/dist-packages/transformers/convert_slow_tokenizer.py:447: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
      "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
    
    Is this something I should take into consideration? Put differently: Is there a way to work around it? I had to install sentencepiece to use the tokenizer in the first place.
  • When running my training loop I also receive the following warnings:
    /usr/local/lib/python3.7/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:746: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      scale, dtype=query_layer.dtype
    /usr/local/lib/python3.7/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:829: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      score += c2p_att / torch.tensor(scale, dtype=c2p_att.dtype)
    /usr/local/lib/python3.7/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:852: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
      score += p2c_att / torch.tensor(scale, dtype=p2c_att.dtype)
    
    Any idea how to circumvent it?

Neither of those are problems, just warnings.