Datasets not behaving as expected after random data augmentation with map

I’m using nlpaug to augment a split of the sst2 dataset. As instructed in the documentation, I’m using map with batched=True for this purpose. The function I pass to map takes one instance (batch_size=1) and generates several instances. The important thing here is that this function is not a pure function, the sentence it generates and the number of instances it generates and returns are completely random. I get a warning each time that says there is a problem with caching and fingerprint, which I assumed is because of the random nature of my function.

After the data augmentation, the dataset acts weird; for example:

synonym_aug_datasets[0.2]["train"].filter(lambda x: x['idx'] > 200000)["idx"][:10]

synonym_aug_datasets is a python dictionary with several augmented datasets in it. I’m simply filtering all instances with idx larger than 200000 and then looking at the idx of the first 10 instances, all of them should have idx larger than 200000, right? Every time I run this code, I get a different result, sometimes even with idx smaller than 200000, and sometimes it won’t even run
output 1:

[55154, 55917, 200628, 409, 6218, 33825, 201639, 2063, 49115, 2959]

output 2:

ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'idx': Value(dtype='int32', id=None), 'label': Value(dtype='int64', id=None), 'sentence': Value(dtype='string', id=None)}

output 3:

[200205,
 200168,
 200888,
 200157,
 201597,
 201784,
 200899,
 200466,
 201086,
 200435]

How is that even possible?

Well, can’t you show the function that you’re passing to map? It’s pretty difficult to figure out what goes wrong otherwise.

I guess that the return value of that function might be the culprit. The logic behind map is that it passes the current batch (which is basically a dict-like object) to the function, takes its output (which should be a dictionary with keys being feature names) and adds or replaces (if names match) key value pairs in the given batch object. So the question is how do you implement the case where multiple augmented sentences are returned by the function? Does your sentence feature become a sequence of strings? Or do you do something else?

So to reiterate, the way you’re returning multiple augmented sentences for a single input sentence might be the cause.

1 Like

In the documentation, I read that I can use map with batched=True to change the size of a dataset, returning fewer or more instances; unfortunately, I can not find it now in the documentation.

It’s a really complicated function, that’s why I didn’t post it, but here it is:

def _aug_map(self, example, aug_function, idx_generator):
        if example["label"][0] != self.minority_class:
            return example
        self.unique_aug = prob_round(self.num_each_sentence_aug)
        if self.unique_aug == 0:
            return example
        sentence = example["sentence"][0]
        augmented_list = pydash.flatten_deep([aug_function.augment(sentence, n=self.unique_aug)])
        label_list = [self.minority_class for _ in range(len(augmented_list))]
        
        idx_list = [next(idx_generator) + i for i in range(len(augmented_list))]

        example["sentence"].extend(augmented_list)
        example["label"].extend(label_list)
        example["idx"].extend(idx_list)
        return example

With lambda, I create another function that takes only example and calls this function with other required arguments, and then I pass that function to map.

Hi !
First, the output 2 that you got ValueError: Keys mismatch comes from a cache issue that has been fixed on the master branch of datasets recently, see Backwards compatibility broken for cached datasets that use `.filter()` · Issue #2943 · huggingface/datasets · GitHub

Then, regarding the other output, can you check the logs if it reloads previously computed results from the cache or not ?

1 Like

Let me tell you my real issue. With the augment function that I shared in my previous post, I’m creating different versions of augmented datasets, but I get the exact same result each time I train my model on them.
Datasets library is acting really, really weird. Before I train my model on several augmented datasets, I check and each dataset is different since it was augmented with different arguments and methods, but after training, when I check them, they are exactly the same.

When I train my model on them, in the log, each time before training starts, the datasets library is loading the exact same cache while the datasets are not actually the same.

Generating different versions of augmented datasets:

synonym_aug_datasets = {}

for aug_p in np.linspace(0.1, 0.5, 5):
    balanced_train = dataset_aug.synonym_augment(aug_p=aug_p)
    balanced_datasets = DatasetDict({
        "train": balanced_train.shuffle(SEED).flatten_indices(),
        "validation": imbalanced_datasets["validation"].flatten_indices(),
        "test": imbalanced_datasets["test"].flatten_indices(),
    })
    synonym_aug_datasets[aug_p] = balanced_datasets

Training on these augmented datasets:

synonym_aug_result = {}

for aug_p, balanced_datasets in synonym_aug_datasets.items():
    preds = custom_train(balanced_datasets)
    synonym_aug_result[aug_p] = preds

synonym_aug_result contains the same preds for each aug_p.

1st time:
Not loading from the cache, processing

2nd time loading from the cache:

Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow

3rd time loading from the cache:

Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow

My custom_train function is basically written from the sample on the tutorial course:

def custom_train(balanced_datasets, checkpoint=CHECKPOINT, seed=SEED, saving_folder=SAVING_FOLDER):
    set_seed(seed)
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    def tokenize_function(example):
        return tokenizer(example["sentence"], truncation=True)

    tokenized_datasets = balanced_datasets.map(tokenize_function, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    saving_folder = SAVING_FOLDER + "_balanced"
    training_args = TrainingArguments(
        saving_folder,
    )

    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    
    trainer.train()
    
    predictions = trainer.predict(tokenized_datasets["test"])
    preds = np.argmax(predictions.predictions, axis=-1)
    
    return preds

If I understand correctly, you are showing the logs of the training, that you did three times on different datasets (first run aug_p=0.1, second run aug_p=0.2, third run aug_p=0.3) ?
The 2nd and 3rd times it’s loading from the cache the tokenized data. It means that it considers that the dataset you passed (even though aug_p changed) is the same as before.

An easy way to debug this and see if this is the case would be to do

for aug_p, balanced_datasets in synonym_aug_datasets.items():
    print(balanced_datasets["train"]._fingerprint)

If the fingerprints are the same, then it means the datasets are the same.
Could you verify the fingerprints please ? This should be helpful to debug things.

However your function is random, and you said that you got a warning about the cache - it must have notified you that it used random fingerprints because of the complex nature of your function.

Can you confirm this is what the warning message said ?

1 Like

Exactly.

Of course. Here it is:

cec3c64992040ae1
dced09bea689051f
7711c97e1932d34d
d86609c29dc217a0
f1b80214a0a91cfc

But after training, I get the exact same result. It loads the exact same cache each time, just as I told you in my previous post.

I saw this warning before, and yes, it said it can’t hash my function, but now I don’t see this. There is no warning when I augment my dataset now.

1 Like

I found a solution for now. Just before tokenizing, I’m converting my datasets to pandas data frame and converting them back to datasets. By doing this, the datasets library doesn’t load the same cache each time and recognizes that my datasets are different.

new_datasets = DatasetDict({
    "train": Dataset.from_pandas(old_datasets["train"].to_pandas()),
    "validation": Dataset.from_pandas(old_datasets["validation"].to_pandas()),
    "test": Dataset.from_pandas(old_datasets["test"].to_pandas()),
})
1 Like