Hugging Face Forums
In Donut Where the output of swin diffused with the text->1.At the starting of Bart encoder,2. cross attention(K,V from swin,Q from attention) of second attention of Bart encoder,3.directly the decoder part of BART
🤗Transformers
shubham05
August 2, 2023, 8:28am
1
is it the same architecture AS follows
WhatsApp Image 2023-08-02 at 00.00.57
1280×654 85.4 KB
is it trained or test in same manner as follows
Screenshot 2023-08-02 003843
903×492 99.8 KB