# Question about loss computing in training masked-language-model

I am studying the course about training a masked language model:

In the part of writing our training loop:

``````from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
# Training
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)

optimizer.step()
lr_scheduler.step()
progress_bar.update(1)

# Evaluation
model.eval()
losses = []
for step, batch in enumerate(eval_dataloader):
outputs = model(**batch)

loss = outputs.loss
losses.append(accelerator.gather(loss.repeat(batch_size)))

losses = torch.cat(losses)
losses = losses[: len(eval_dataset)]
try:
perplexity = math.exp(torch.mean(losses))
except OverflowError:
perplexity = float("inf")

print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

# Save and upload
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
tokenizer.save_pretrained(output_dir)
repo.push_to_hub(
commit_message=f"Training in progress epoch {epoch}", blocking=False
)
``````

I am curious about why we should set `losses = losses[: len(eval_dataloader)]` ?

there are 1000 samples in eval_dataset, I set batch_size=8, so there are exactly 125 batches in eval_dataloader.

after `losses = torch.cat(losses)`, there are 1000 items in `losses`, I understand this, since we have 1000 samples.

But then we only remain the first 125 items in `losses` by `losses = losses[: len(eval_dataloader)]`, and then the perplexity is calculated by the mean of the reamining 125 losses.

I printed the losses after `losses = losses[: len(eval_dataloader)]`:

``````tensor([2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.4906,
2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.1742, 2.1742,
2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1273, 2.1273, 2.1273,
2.1273, 2.1273, 2.1273, 2.1273, 2.1273, 2.5359, 2.5359, 2.5359, 2.5359,
2.5359, 2.5359, 2.5359, 2.5359, 2.1983, 2.1983, 2.1983, 2.1983, 2.1983,
2.1983, 2.1983, 2.1983, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421,
2.4421, 2.4421, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299,
2.6299, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257,
2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4120,
2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.0684, 2.0684,
2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0655, 2.0655, 2.0655,
2.0655, 2.0655, 2.0655, 2.0655, 2.0655, 2.2470, 2.2470, 2.2470, 2.2470,
2.2470, 2.2470, 2.2470, 2.2470, 2.2354, 2.2354, 2.2354, 2.2354, 2.2354,
2.2354, 2.2354, 2.2354, 1.8888, 1.8888, 1.8888, 1.8888, 1.8888])
``````

we can see, each 8 losses are the same.

My question is:
Shouldn’t we calculate the loss by the mean of all the losses from all 1000 samples in the eval dataset? Why only choose the first `len(eval_dataloader)` losses?