Can I use ConvBertModel as a decoder ?
I set is_decoder = True
. So I assume it should apply autoregressive mask.
It converged very quickly on a training dataset.
But when I generate sequence inserting tokens one by one it gives wrong result. I suspect that autoregressive mask is either not aplied or disregarded upon convolution (information from consecutive tokens leaks regradless mask).