How To Change Output Shape Of Multi Head Self Attention Output To A Shape That Can Be Fed To Convolution Layer

Hello, I encountered an error like this:

The output of MHSA (multi-head self-attention) is as follows:

torch.Size([20, 197, 768])
  • 20 for batch size
  • 197 for sequence length (previously 196, after adding the class token it became 197)
  • 768 for embedding dimension

I want to reshape it to fit the format below in order to feed it to a convolutional layer:

torch.Size([batch_size, channel, width, height])

I’ve attempted to achieve this by adding a new dimension using the following approach:

torch.unsqueeze(1) torch.transpose(1, 3)

This successfully allows feeding to the convolutional layer. However, I’m unsure if this approach is correct, so please correct me if it’s not.

Currently, I’m trying a different approach:

new_size = int(math.sqrt(sequence_length))
torch.transpose(1, 2).view(batch_size, embed_dim, new_size, new_size)

This resulted in an error stating that the shape is invalid for an input of size (some_number). This is because the sequence length (197) doesn’t square perfectly, yielding a decimal number, and the view function expects an integer input, the square operation yields 16 after converting to an integer, but batch_size * 768 * 16 * 16 does not equal batch_size * 197 * 768, leading to the error

Is my analysis correct? And how can I resolve this issue? and is there any better approach?