Hi,
I’m trying to create a DistilBertModel
model for sequence classification, such that max_position_embeddings=1024
(otherwise I would have used DistilBertForSequenceClassification
which is defult to max_position_embeddings=512
)
I define the model in the following way:
configuration = DistilBertConfig(max_position_embeddings=1024)
model = DistilBertModel(configuration)
When forwarding an input to the model in the following way:
output = model(ids, attention_mask = mask, return_dict=False)[0]
such that ids.shape = (batch_size, 1024)
and mask.shape = (batch_size, 1024)
the shape of the output is (batch_size, 1024, 768)
.
My question is: What is the best practice to convert this output into a probability vector over the number of labels such the modified output shape would be (batch_size, num_labels)
?
I thought of a few options including flattening the current output + an additional FC layer, but I’m not sure this is the best practice.
Thank you in advance