Regarding the data input injected into transformer_xl or transformer models

If I proceed with the code according to the pytorch version, the following data will be injected into the model’s input.
bsz =10
target_len = 3
1 11 21
2 12 22.
3 13 23.
4 14 24.
5 15 25
6 16 26
7 17 27.
8 18 28.
9 19 29.
10 20 30.

And this is one sentence.

That’s where the question arises.

If the data of the shape above is entered through the input of the model, will it be learned between the components of the batch.

I will explain it in more detail to prevent misunderstandings from occurring.

When the data enters the model, it becomes transpose and changes the arrangement as shown below.
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30

In this state, the batch size is 3 and three separate inputs are added to the model.

The problem here is that the above three inputs are one sentence.

In this case, I wonder if the separate 3 inputs entering one batch can be learned in a state connected to each other.

If possible, I’d like to know the reason and principle.