I have built a model from scratch, inspired by the Transformer model and related code (such as ViT), with the goal of recognizing CAPTCHAs. However, during training, I’ve encountered an issue with the Transformer model. After several batch iterations, I consistently observe that the highest probability value in the output probability matrix is
<EOS>, and this problem persists even after prolonged training.
Here is an overview of my approach: I initially followed the ViT approach, where I divide input images into many small patches. Each patch is then linearly mapped to a fixed emb_d dimension. For the decoder, I map the CAPTCHA letters to the same fixed emb_d values (note: the vocabulary includes digits and letters [0-9a-zA-Z]). This way, I construct an input sequence for the encoder.
For the encoder, I use the image patches as input and pass them through multiple encoder blocks, each consisting of multi-head self-attention layers, layer normalization, residual connections, and linear layers. Finally, the encoder’s output matches the input’s shape, i.e., [batch len_batch emb_d], and this output serves as both the key and value matrices for the decoder.
For the decoder, I use the target sequence (with a shape of [batch len_batch emb_d] and the last token removed) as input and set the target sequence (with the first token removed) as the actual target. I then compute the cross-entropy loss between the output and the target.
The issues I’ve identified are as follows: In the screenshots, it’s evident that after taking the argmax of the output probability matrix, it should yield the index of the predicted label (out), which ideally should match the target index (tgt). However, I’ve noticed that the output for ‘out’ consistently corresponds to index 1, indicating “
You can find the code for this top get the errors of structures in the following location：
I have roughly verified the network structure and found no errors, but I remain uncertain. I hope someone can help me analyze this issue, and I would be extremely grateful for any assistance in resolving it.