ozoloev
November 29, 2024, 5:14pm
1
I just have a torch dataloader and do multi gpu inference with accelerate. In loader I have a 3 fields: input_ids, attention_mask and user_ids, so, after get_inference_dataset this features still in loader, but after .prepare, I don’t have user_ids filed, only ids and mask.
inference_loader = get_inference_dataset(config, tokenizer)
print(next(iter(inference_loader))[“user_ids”])
inference_loader = accelerator.prepare_data_loader(inference_loader)
print(next(iter(inference_loader))[“user_ids”])
1 Like
This seems to happen because of drop_last=True . I don’t know if this is the case.
opened 06:47AM - 09 Jan 24 UTC
closed 03:06PM - 17 Apr 24 UTC
In [tutorial](https://huggingface.co/docs/accelerate/quicktour), it is mentioned… that `Some data at the end of the dataset may be duplicated so the batch can be divided equally among all workers.` So if my train dataset size cannot divided by the #gpus, dataloader after prepare() will include duplicated data during training? Won't this property affect model performance(loss etc.) since it includes more data in train dataset? If it will cause a large difference, is there any method to not include these duplicated data in calculating loss?