Question about loss computing in training masked-language-model

beyond · March 17, 2022, 2:27pm

I am studying the course about training a masked language model:

In the part of writing our training loop:

from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

I am curious about why we should set losses = losses[: len(eval_dataloader)] ?

there are 1000 samples in eval_dataset, I set batch_size=8, so there are exactly 125 batches in eval_dataloader.

after losses = torch.cat(losses), there are 1000 items in losses, I understand this, since we have 1000 samples.

But then we only remain the first 125 items in losses by losses = losses[: len(eval_dataloader)], and then the perplexity is calculated by the mean of the reamining 125 losses.

I printed the losses after losses = losses[: len(eval_dataloader)]:

tensor([2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.4906,
        2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.1742, 2.1742,
        2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1273, 2.1273, 2.1273,
        2.1273, 2.1273, 2.1273, 2.1273, 2.1273, 2.5359, 2.5359, 2.5359, 2.5359,
        2.5359, 2.5359, 2.5359, 2.5359, 2.1983, 2.1983, 2.1983, 2.1983, 2.1983,
        2.1983, 2.1983, 2.1983, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421,
        2.4421, 2.4421, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299,
        2.6299, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257,
        2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4120,
        2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.0684, 2.0684,
        2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0655, 2.0655, 2.0655,
        2.0655, 2.0655, 2.0655, 2.0655, 2.0655, 2.2470, 2.2470, 2.2470, 2.2470,
        2.2470, 2.2470, 2.2470, 2.2470, 2.2354, 2.2354, 2.2354, 2.2354, 2.2354,
        2.2354, 2.2354, 2.2354, 1.8888, 1.8888, 1.8888, 1.8888, 1.8888])

we can see, each 8 losses are the same.

My question is:
Shouldn’t we calculate the loss by the mean of all the losses from all 1000 samples in the eval dataset? Why only choose the first len(eval_dataloader) losses?

Topic		Replies	Views
Batched BertForMaskedLM inference loss issue Intermediate	0	691	February 23, 2022
Wav2Vec2 Loss Function Question 🤗Transformers	1	212	July 24, 2024
Masked language modeling loss 🤗Transformers	1	4684	August 13, 2020
Trainer.evaluate() 🤗Transformers	3	6877	May 11, 2021
Trainer.evaluate() doesn't return evaluation loss 🤗Transformers	2	446	February 17, 2025

Question about loss computing in training masked-language-model

Related topics