Chunks and batches in MLMs

Good day everyone.

I was going through the course and there is one place that I’m unable to understand. I’m talking about the “Main NLP tasks” section - " Fine-tuning a masked language model". There was a piece of the following code:

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

As I understand correctly, it basically splits our text(s) into chunks.
I refer to BERT as a particular MLM everywhere below.

Let’s say we called the group_texts function, and at some chunk “X” we have the following:
[‘trying’, ‘to’, ‘understand’, ‘[SEP]’, ‘[CLS]’, …, ‘respect’]
And at the “X+1” chunk we will have:
[’##ful’, …]

And then we put these chunks into batches.

Question #1:
For RNNs and LSTMs we had to use contiguous batches to retain hidden states of the model between batches (aka “stateful” RNNs). In transformers, we pass the whole chunk through it. However, if a sentence inside a chunk getting truncated, the second part of this sentence will be in the next chunk. So, we will pass these incomplete sentences through the transformer. If we used LSTM, we would pass previous hidden states to retain meaning of this truncated sentence, however, transformers don’t pass any hidden states to the next step. How does it know about that sentence from the previous batch?

Question #2:
From my example above, there can be a case when subwords of a word end up in different chunks. In my case it’s ‘respect’ and ‘##ful’. When I want to do whole word masking, I have to move ‘respect’ from the first chunk into the second chunk and pad the first chunk? I don’t get it here.

Question #3:
Multiple sentences can be in a chunk. A sentence is surrounded by the [CLS] and [SEP] tokens. Is attention for a specific token in this sentence only calculated with respect to the tokens between [CLS] and [SEP]? If so, how does it ignore tokens that are before/after these [CLS]/[SEP] boundaries? In other words, tokens that do not belong to this particular sentence.

Any links that explain these questions would be helpful as well.

I would really appreciate your help. Thank you.