How to make a model predict on only some tokens

Consider the following simplified example:

input = tensor([1, 2, 3, 1, 2, 5, 1, 4, 7])

Now, consider that 1 is equivalent to the [MASK] token. I want to run a RobertaForMaskedLM model to predict alternatives to only the masked tokens, ie. the tokens having entry = 1. A standard model(**inputs) where inputs is of the form {'input_ids': torch.tensor(...), 'attention_mask': torch.tensor(...), 'labels': torch.tensor(...)}
How can I go about achieving this?


If you train with MaskedLM, set label only [Mask] token.
roberta docs

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

You need to set input text to input encodings.
Let me show some example.

there some origin input set. (it is just set by random vocab id.)

  • [a,b,c,d,e]
  • [12,22,55,465,44]

and make some token masked.(mask token id is 1, you can set at tokenizer’s special token mapping.)

  • [a,[Mask],b,c,d,[Mask]]
  • [12,1,55,465,1]

then make label to [Mask] token will predict original token, other input will be ignored.

  • [[UNK],b,[UNK],[UNK],[UNK],e]
  • [-100, 22, -100, -100, 44]

Mask to avoid performing attention on padding token indices.
if your input will padding with max_size =6 input and attention_mask will be like this. (pad_token id = 3)

  • [a,[Mask],b,c,d,[Mask],[PAD]]
  • [12,1,55,465,1,3]

attention_mask will [1,0], 1 = not pad token, 0 = pad token.

  • [1,1,1,1,1,0]

so, you can make encoding like this.

  1. input id
    [12, 1, 55, 465, 1, 0] #([a,[Mask], c, d, [Mask], [PAD]])
  2. attention_mask
    [1, 1, 1, 1, 1, 0]
  3. label
    [-100, 22, -100, -100, 44, -100] #([[UNK], b, [UNK], [UNK], e, [UNK]])

Also, you don’t forget [CLS], [SEP] token.

here is huggingface official docs.
there are RobertaMaskedLM’s Input parameter at
Roberta_MaskedLM input parameters

Hope to help.


1 Like