Let me tell you my real issue. With the augment function that I shared in my previous post, I’m creating different versions of augmented datasets, but I get the exact same result each time I train my model on them.
Datasets library is acting really, really weird. Before I train my model on several augmented datasets, I check and each dataset is different since it was augmented with different arguments and methods, but after training, when I check them, they are exactly the same.
When I train my model on them, in the log, each time before training starts, the datasets library is loading the exact same cache while the datasets are not actually the same.
Generating different versions of augmented datasets:
synonym_aug_datasets = {}
for aug_p in np.linspace(0.1, 0.5, 5):
balanced_train = dataset_aug.synonym_augment(aug_p=aug_p)
balanced_datasets = DatasetDict({
"train": balanced_train.shuffle(SEED).flatten_indices(),
"validation": imbalanced_datasets["validation"].flatten_indices(),
"test": imbalanced_datasets["test"].flatten_indices(),
})
synonym_aug_datasets[aug_p] = balanced_datasets
Training on these augmented datasets:
synonym_aug_result = {}
for aug_p, balanced_datasets in synonym_aug_datasets.items():
preds = custom_train(balanced_datasets)
synonym_aug_result[aug_p] = preds
synonym_aug_result
contains the same preds for each aug_p
.
1st time:
Not loading from the cache, processing
2nd time loading from the cache:
Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow
3rd time loading from the cache:
Loading cached processed dataset at /tmp/tmpwwka_321/cache-5a9967131defac9d.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-b9a54799246ef5f7.arrow
Loading cached processed dataset at /tmp/tmpwwka_321/cache-7f6e5f5590fb6dfa.arrow
My custom_train
function is basically written from the sample on the tutorial course:
def custom_train(balanced_datasets, checkpoint=CHECKPOINT, seed=SEED, saving_folder=SAVING_FOLDER):
set_seed(seed)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence"], truncation=True)
tokenized_datasets = balanced_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
saving_folder = SAVING_FOLDER + "_balanced"
training_args = TrainingArguments(
saving_folder,
)
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=-1)
return preds