The output dimensions of BERT are 768-dimensional, is it possible to reduce them to a lower, custom number? And if not, are there any possible workarounds for this issue?
BERT was pre-trained with the 768-dimensional output, so if you use a pre-trained model, the final layer will have that dimensionality. However, you can always take the output logits and pass them through another linear layer that will map 768 dimensions to a custom dimension.
While that is true, I am not sure how to frame that machine learning problem. For instance, what would be my labels for calculating the loss and hence training the weights?
If my output was also 768 dimensional, I could have used any task to train BERT. However, I am a little uncertain how to approach this problem now.
You can use the MLM objective and thus “further pre-train” the model, but I don’t know how much that’ll help the weights get the same “amount of knowledge” as the pre-trained 768-dimensional output. Also, you can try some dimensionality reduction techniques before trying BERT architecture changes, which is probably no trivial task.