Datasets not behaving as expected after random data augmentation with map

SMMousavi · September 21, 2021, 11:42am

Let me tell you my real issue. With the augment function that I shared in my previous post, I’m creating different versions of augmented datasets, but I get the exact same result each time I train my model on them.
Datasets library is acting really, really weird. Before I train my model on several augmented datasets, I check and each dataset is different since it was augmented with different arguments and methods, but after training, when I check them, they are exactly the same.

When I train my model on them, in the log, each time before training starts, the datasets library is loading the exact same cache while the datasets are not actually the same.

Generating different versions of augmented datasets:

synonym_aug_datasets = {}

for aug_p in np.linspace(0.1, 0.5, 5):
    balanced_train = dataset_aug.synonym_augment(aug_p=aug_p)
    balanced_datasets = DatasetDict({
        "train": balanced_train.shuffle(SEED).flatten_indices(),
        "validation": imbalanced_datasets["validation"].flatten_indices(),
        "test": imbalanced_datasets["test"].flatten_indices(),
    })
    synonym_aug_datasets[aug_p] = balanced_datasets

Training on these augmented datasets:

synonym_aug_result = {}

for aug_p, balanced_datasets in synonym_aug_datasets.items():
    preds = custom_train(balanced_datasets)
    synonym_aug_result[aug_p] = preds

synonym_aug_result contains the same preds for each aug_p.

1st time:
Not loading from the cache, processing

2nd time loading from the cache:

Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow

3rd time loading from the cache:

Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow

My custom_train function is basically written from the sample on the tutorial course:

def custom_train(balanced_datasets, checkpoint=CHECKPOINT, seed=SEED, saving_folder=SAVING_FOLDER):
    set_seed(seed)
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    def tokenize_function(example):
        return tokenizer(example["sentence"], truncation=True)

    tokenized_datasets = balanced_datasets.map(tokenize_function, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    saving_folder = SAVING_FOLDER + "_balanced"
    training_args = TrainingArguments(
        saving_folder,
    )

    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
    
    trainer.train()
    
    predictions = trainer.predict(tokenized_datasets["test"])
    preds = np.argmax(predictions.predictions, axis=-1)
    
    return preds

Topic		Replies	Views
The datasets.map function does not load cached dataset Beginners	7	2295	November 21, 2023
Dataset can't cache model's outputs 🤗Datasets	3	475	October 27, 2022
Increase on disk space when using map() in Accelerate environment 🤗Datasets	2	1178	August 18, 2022
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2738	March 22, 2023
Dealing with large objects as arguments in datasets.map 🤗Datasets	2	699	October 21, 2021

Datasets not behaving as expected after random data augmentation with map

Related topics