[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task

daviddoo · March 16, 2024, 11:41am

Hi. I have a problem. I am working on pretraining a RoBERTa MLM model from scratch on Slovak language text in Python. I have trained my own BPE tokenizer and tokenized texts with it. I obtained the dictionary of encodings, with max_length=256. Sample here

The problem arises during the training of the MLM model. Within individual epochs and batch processing, the loss decreases, indicating that the model is training properly. However, when using the trained model to predict the masked token for each input, I get the same output consistently. That is, the same tokens in the same order, with the same probabilities, which is incorrect. Additionally, special tokens such as pad, s, and /s are also included in the possible tokens with very high probabilities.

from transformers import pipeline
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_path = 'PureBPEMLM_epoch_0'
tokenizer_path = './tokenizers/pureBPE'

tokenizer_ = RobertaTokenizer.from_pretrained(tokenizer_path)
model_ = AutoModelForMaskedLM.from_pretrained(model_path)

fill = pipeline(
    'fill-mask',
    model=model_,
    tokenizer=tokenizer_
)

fill('cez bratislavu tečie <mask> a žijú tam ryby.')

Sample here

Below is the code for training the model. I should also mention that the length of the dataloader is 13987, due to a batch size of 64, resulting in a total of 895168 sequences.

with open("./09032024_purebpe", 'rb') as file:
    encodings = pickle.load(file)
    
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}
    
dataset = Dataset(encodings.data)
batch_size = 64
num_workers = 8  
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)


tokenizer_path = './tokenizers/pureBPE'
tokenizer = RobertaTokenizer.from_pretrained(tokenizer_path)


# Your existing model, optimizer, and training loop
config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=258,
    hidden_size=576,
    num_attention_heads=12,
    num_hidden_layers=6,
    dropout=0.1,
    type_vocab_size=1
)

model = RobertaForMaskedLM(config)
print('Num parameters: ',model.num_parameters())

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

model.train()
optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

epochs = 5


for epoch in range(epochs):
    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

            
    # Save after each epoch
    model_save_path = f'./PureBPEMLM_epoch_{epoch}'
    model.save_pretrained(model_save_path)

Could someone please advise me on how to solve this problem?

Thank you in advance for your answer.

Sandy1857 · March 17, 2024, 10:01pm

Why is your labels allmost identical to your input_ids? What is it anyway?

daviddoo · March 18, 2024, 3:36pm

That’s because 15% of the tokens are masked (input_ids) by the token with id = 4, and all tokens in the labels are without masking. I’ll show a better example here.
But I have already managed to find out something. If I do not exceed a certain number of training steps (for me 7000), the model can deal with the PAD token. But I don’t know why that is.

Sandy1857 · March 18, 2024, 5:44pm

Ok, but why just mask only the one with id =4? Isn’t it supposed to be random masking of 15% tokens?

samchain · March 18, 2024, 7:46pm

Hello,

According to me it’s highly probable that your tokenizer is mis specified. Check that your special token are correctly encoded and that their ids are different from your tokens (the one in the vocabulary). Also, the fact that yoiur model is predicting pad token might indicate a problem in the attention masks. Theoretically, your attention masks should be at 0 on pad token in order to avoid attention to be put on it. Also, your type vocabulary is set to one, which from what I remember should be 2 as you have special tokens and “vanilla” ones.

Hope this helps

daviddoo · March 19, 2024, 2:27pm

Because the mask id is 4. Here is a sample of the first 12 tokens:

<s>:0
<pad>:1
</s>:2
<unk>:3
<mask>:4
!:5
":6
#:7
$:8
%:9
&:10
':11
(:12

And yes, I mask everything except tokens 0, 1 and 2 with a probability of 15%.

def mlm(tensor):
    rand = torch.rand(tensor.shape)
    mask_arr = (rand < 0.15) & (tensor > 2)
    tensor[mask_arr] = 4
    return tensor

daviddoo · March 19, 2024, 2:39pm

I checked everything you mentioned, I have the tokenizer created correctly. All tokens should be properly encoded, I provide an example here.
Well, you’re right about the type_vocab_size parameter, it should be set to 2, but from what I’ve read about it in the documentation, the type_vocab_size parameter actually allows the model to work with multiple types of input segments, which is useful for various NLP tasks. so if i’m just doing token masking task can it be set to 1?

Topic		Replies	Views
Further pre-training the tokenizer? 🤗Tokenizers	0	817	April 30, 2022
RoBERTa fine-tuning on a dataset of short sentences and low cardinality 🤗Transformers	0	710	December 4, 2023
Sequence Length in Continued Pretraining (MLM) & Masking Strategies Intermediate	0	1170	January 6, 2022
Error training MLM with Roberta Tokenizer 🤗Tokenizers	1	1439	September 17, 2023
Training RoBERTa from scratch: error? 🤗Transformers	0	585	August 26, 2021

[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task

Related topics