Hi, I am willing to configure RobertaTokenizer
such that it outputs token_type_ids
that it doesn’t by default. Is there a way to do that?
I have changed the model configuration and updated its type_vocab_size
to 2, like so:
model = RobertaModel.from_pretrained('roberta-base')
# Update config to finetune token type embeddings
model.config.type_vocab_size = 2
# Create a new Embeddings layer, with 2 possible segments IDs instead of 1
model.embeddings.token_type_embeddings = nn.Embedding(2, model.config.hidden_size)
# Initialize it
model.embeddings.token_type_embeddings.weight.data.normal_(mean=0.0, std=model.config.initializer_range)
I want to input token_type_ids to the model instance like so:
model(token_ids, attn_masks, token_type_ids)