For example, I have a dataset with 500 labels. Each label has 1-20G raw image data. For the training, I want each batch can randomly sample 256 images of the same label, and randomly select batch_size
of labels of data.
Since the whole dataset is huge I created a Huggingface Dataset file for each label. etc.,
./dataset/label1/data-00000-of-00001.arrow, ./dataset/label2/data-00000-of-00001.arrow ...
I tried two methods to load the dataset to see if I can speed up:
gid = random.sample(self.labels, 1)[0]
files = get_arrow_files(os.path.join(self.dataset_path, gid))
data_files = {"train": files}
# ds = load_dataset("arrow", split='train', data_files=data_files, streaming=True)
ds = Dataset.from_file(files[0])
ds = ds.shuffle()
datas = []
for idx, batch in enumerate(ds):
datas.append(batch)
if (idx+1) == self.num_sample:
break
return np.asarray(datas)
For the first one load_dataset("arrow", split='train', data_files=data_files, streaming=True)
, it can load faster than ds = Dataset.from_file(files[0])
for one batch, but it can’t use multi-workers to simultaneously load all batches since it returns a warning:
WARNING:datasets.iterable_dataset:Too many dataloader workers: 2 (max is dataset.n_shards=1). Stopping 1 dataloader workers.
For the second one ds = Dataset.from_file(files[0])
, it can simultaneously load multiple arrow files, but it will be pretty slow if the arrow files are quite huge like 20G and the queue will be stacked until the first worker finish loading.
I’m wondering if there’s any solution that I can speed up the data loading process without changing the data sampling strategy. Thanks!