Chunks and batches in MLMs

rs15 · December 23, 2021, 1:13pm

Good day everyone.

I was going through the course and there is one place that I’m unable to understand. I’m talking about the “Main NLP tasks” section - " Fine-tuning a masked language model". There was a piece of the following code:

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

As I understand correctly, it basically splits our text(s) into chunks.
I refer to BERT as a particular MLM everywhere below.

Let’s say we called the group_texts function, and at some chunk “X” we have the following:
[‘trying’, ‘to’, ‘understand’, ‘[SEP]’, ‘[CLS]’, …, ‘respect’]
And at the “X+1” chunk we will have:
[’##ful’, …]

And then we put these chunks into batches.

Question #1:
For RNNs and LSTMs we had to use contiguous batches to retain hidden states of the model between batches (aka “stateful” RNNs). In transformers, we pass the whole chunk through it. However, if a sentence inside a chunk getting truncated, the second part of this sentence will be in the next chunk. So, we will pass these incomplete sentences through the transformer. If we used LSTM, we would pass previous hidden states to retain meaning of this truncated sentence, however, transformers don’t pass any hidden states to the next step. How does it know about that sentence from the previous batch?

Question #2:
From my example above, there can be a case when subwords of a word end up in different chunks. In my case it’s ‘respect’ and ‘##ful’. When I want to do whole word masking, I have to move ‘respect’ from the first chunk into the second chunk and pad the first chunk? I don’t get it here.

Question #3:
Multiple sentences can be in a chunk. A sentence is surrounded by the [CLS] and [SEP] tokens. Is attention for a specific token in this sentence only calculated with respect to the tokens between [CLS] and [SEP]? If so, how does it ignore tokens that are before/after these [CLS]/[SEP] boundaries? In other words, tokens that do not belong to this particular sentence.

Any links that explain these questions would be helpful as well.

I would really appreciate your help. Thank you.

marcelcramer · June 22, 2023, 11:05am

Hello,

thank you @rs15 for the questions. I have a similar UC. Would be great if someone with more experience could contribute to those questions.

Topic		Replies	Views
Query about group_texts in run_mlm_no_trainer.py Beginners	0	647	April 12, 2022
How to determine optimal batch & chunk size for MLM? Beginners	1	3354	January 5, 2023
Fine-tuning a masked language model Beginners	0	355	February 2, 2022
How to customize BERT MLM task Beginners	6	1785	September 27, 2023
Best practice for MLM: full text or break into sentences? Beginners	0	486	November 18, 2021

Chunks and batches in MLMs

Related topics