get batch indices when iterating DataLoader over a Dataset

dimid · July 5, 2021, 8:13pm

The code below is taken from the tutorial

from datasets import load_metric

metric= load_metric("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

Inside the loop for batch in eval_dataloader:, how can I know which indices from the dataset this batch includes?

The DataLoader is created earlier using

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

Note that it’s without the shuffling flag, so it’s possible to count manually using batch size, but how to do it with shuffling? Is it possible to to make it a field of the batch when creating the dataset and dataloader?

lhoestq · July 20, 2021, 9:18am

Hi ! I think it’s possible to get the indices used by the sampler (see Indices of a dataset sampled by DataLoader - PyTorch Forums for example).

Topic		Replies	Views
Getting correct length via DataLoader and speed 🤗Datasets	4	449	April 5, 2024
Slow DataLoader with big batch_size 🤗Datasets	4	1734	October 5, 2023
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5034	June 21, 2023
Helping for Evaluation Video Classification Model (with a IterableDataset) 🤗Transformers	0	81	May 29, 2024
DataLoader from accelerator samples from beginning of dataset for last batch 🤗Accelerate	1	661	January 15, 2024

get batch indices when iterating DataLoader over a Dataset

Related topics