I’m using nlpaug to augment a split of the sst2 dataset. As instructed in the documentation, I’m using map
with batched=True
for this purpose. The function I pass to map
takes one instance (batch_size=1
) and generates several instances. The important thing here is that this function is not a pure function, the sentence it generates and the number of instances it generates and returns are completely random. I get a warning each time that says there is a problem with caching and fingerprint, which I assumed is because of the random nature of my function.
After the data augmentation, the dataset acts weird; for example:
synonym_aug_datasets[0.2]["train"].filter(lambda x: x['idx'] > 200000)["idx"][:10]
synonym_aug_datasets is a python dictionary with several augmented datasets in it. I’m simply filtering all instances with idx
larger than 200000 and then looking at the idx
of the first 10 instances, all of them should have idx
larger than 200000, right? Every time I run this code, I get a different result, sometimes even with idx
smaller than 200000, and sometimes it won’t even run
output 1:
[55154, 55917, 200628, 409, 6218, 33825, 201639, 2063, 49115, 2959]
output 2:
ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'idx': Value(dtype='int32', id=None), 'label': Value(dtype='int64', id=None), 'sentence': Value(dtype='string', id=None)}
output 3:
[200205,
200168,
200888,
200157,
201597,
201784,
200899,
200466,
201086,
200435]
How is that even possible?