I am studying the course about training a masked language model:
In the part of writing our training loop:
from tqdm.auto import tqdm
import torch
import math
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
# Training
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
# Evaluation
model.eval()
losses = []
for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
losses.append(accelerator.gather(loss.repeat(batch_size)))
losses = torch.cat(losses)
losses = losses[: len(eval_dataset)]
try:
perplexity = math.exp(torch.mean(losses))
except OverflowError:
perplexity = float("inf")
print(f">>> Epoch {epoch}: Perplexity: {perplexity}")
# Save and upload
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
tokenizer.save_pretrained(output_dir)
repo.push_to_hub(
commit_message=f"Training in progress epoch {epoch}", blocking=False
)
I am curious about why we should set losses = losses[: len(eval_dataloader)]
?
there are 1000 samples in eval_dataset, I set batch_size=8, so there are exactly 125 batches in eval_dataloader.
after losses = torch.cat(losses)
, there are 1000 items in losses
, I understand this, since we have 1000 samples.
But then we only remain the first 125 items in losses
by losses = losses[: len(eval_dataloader)]
, and then the perplexity is calculated by the mean of the reamining 125 losses.
I printed the losses after losses = losses[: len(eval_dataloader)]
:
tensor([2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.5507, 2.4906,
2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.4906, 2.1742, 2.1742,
2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1742, 2.1273, 2.1273, 2.1273,
2.1273, 2.1273, 2.1273, 2.1273, 2.1273, 2.5359, 2.5359, 2.5359, 2.5359,
2.5359, 2.5359, 2.5359, 2.5359, 2.1983, 2.1983, 2.1983, 2.1983, 2.1983,
2.1983, 2.1983, 2.1983, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421, 2.4421,
2.4421, 2.4421, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299, 2.6299,
2.6299, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257, 2.6257,
2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4568, 2.4120,
2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.4120, 2.0684, 2.0684,
2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0684, 2.0655, 2.0655, 2.0655,
2.0655, 2.0655, 2.0655, 2.0655, 2.0655, 2.2470, 2.2470, 2.2470, 2.2470,
2.2470, 2.2470, 2.2470, 2.2470, 2.2354, 2.2354, 2.2354, 2.2354, 2.2354,
2.2354, 2.2354, 2.2354, 1.8888, 1.8888, 1.8888, 1.8888, 1.8888])
we can see, each 8 losses are the same.
My question is:
Shouldn’t we calculate the loss by the mean of all the losses from all 1000 samples in the eval dataset? Why only choose the first len(eval_dataloader)
losses?