Error training with iterabledatasets

Hello, I’m stuck on an error trying to train Wave2Vec2 with an IterableDatasetDict. I pasted in some key code below, but I can just send the .py file if that would be better.

Here is the error:

Num examples = 16
Num Epochs = 9223372036854775807
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 1
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\torch\utils\data\dataloader.py”, line 652, in next
data = self._next_data()
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\torch\utils\data\dataloader.py”, line 692, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\torch\utils\data_utils\fetch.py”, line 32, in fetch
data.append(next(self.dataset_iter))
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\datasets\formatting\dataset_wrappers\torch_iterable_dataset.py”, line 28, in iter
yield from IterableDataset.iter(self)
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\datasets\iterable_dataset.py”, line 599, in iter
for key, example in self._iter():
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\datasets\iterable_dataset.py”, line 579, in _iter
yield from ex_iterable
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\datasets\iterable_dataset.py”, line 280, in iter
for key, example in iterator:
File “C:\Users\jfolstein\Documents\Projects\HOAX\Wav2Vec2v2\venv\lib\site-packages\datasets\iterable_dataset.py”, line 457, in iter
for _ in islice(ex_iterator, self.n):
ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize.

When I follow the debugger into iterable_dataset.py to the line
for _ in islice(ex_iterator, self.n)
self.n is a float

I haven’t yet been able to figure out what self.n is exactly.

Here is how I am creating the iterable datasets:

trnsets =
ntrnfiles = 0
for set in trnsets2use:
myset = loaddsetshards(set[‘path’])
trnsets.append(myset[‘train’])
ntrnfiles += set[‘nfiles’]

print(f"total training files: {str(ntrnfiles)}")
dset_training = hfds.interleave_datasets(trnsets, probabilities=[.1, .9])
dset_training = dset_training.shuffle(seed=1)
dset_trn = dset_training.skip(ntrnfiles/4)
dset_val = dset_training.take(ntrnfiles/4)

dset_tst = loaddsetshards(testsets2use[0][‘path’])

dset = hfds.IterableDatasetDict({‘train’: dset_trn, ‘test’: dset_tst, ‘validation’: dset_val})

loaddsetshards loads a bunch of .json shards in a diectory:

def loaddsetshards(dsetdir):
_, _, filepaths = get_filepaths(dsetdir)
dslist =
print(f"loading all shards in {dsetdir}")
dsetout = datasets.load_dataset(dsetdir, streaming=True)
return dsetout

here are the training args…

numtrnepochs = 4
batchperdev = 16
stepsperep = numtrnepochs/batchperdev
maxsteps = int(numtrnepochs*stepsperep)

training_args = TrainingArguments(
output_dir=outputpath,
logging_dir=logpath,
per_device_train_batch_size=batchperdev,
evaluation_strategy=“steps”,
num_train_epochs=numtrnepochs,
fp16=True,
gradient_checkpointing=True,
save_steps=300,
eval_steps=300,
logging_steps=300,
learning_rate=1e-3, # 1e-4,
weight_decay=0.005,
warmup_steps=1000,
save_total_limit=2,
remove_unused_columns=True,
max_steps=maxsteps
)
print(‘making trainer’)
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=dset[“train”],
eval_dataset=dset[“validation”],
tokenizer=processor.feature_extractor,
)

Ok, I fixed this by changing

dset_trn = dset_training.skip(ntrnfiles/4)
dset_val = dset_training.take(ntrnfiles/4)

to

dset_trn = dset_training.skip(int(ntrnfiles/4))
dset_val = dset_training.take(int(ntrnfiles/4))

Dividing by 4 changed a counter into a float somewhere. Might be worth throwing an informative error message in there, but I seem to be the only one dumb enough to make that mistake so far.