Kfolds leaking into subsequent folds

Chantland · April 18, 2024, 5:23pm

I am attempting to use Kfolds cross-validation to train my model but the F1 score keeps on increasing between folds, which tells me that there is some serious data leaking.

Here is my training code

for fold, (train_idxs, val_idxs) in enumerate(zip(train_list, val_list), start=1): # K-fold loop
        
        output_dir = f"{model_name}/output_dir_{weight_decay}_fold_{fold}"

        resume_bool = False
        
        #Skip folds already completed
        if os.path.exists(f"{output_dir}"):
            if os.path.exists(f"{output_dir}/finished.txt"):
                print('\033[93m'+ f"Skipping {output_dir} as it is indicated as finished" + '\033[0m')
                continue
            else:
                print('\033[93m'+ f"Starting from last checkpoint {output_dir}"+ '\033[0m')
                resume_bool = True # resume from the last checkpoint if there is an output folder but it is not finished.


        print(f"------Fold {fold}/{len(train_list)}--------\n")
        train_ds = tokenized_Hraf["train"].select(train_idxs)
        val_ds = tokenized_Hraf["train"].select(val_idxs)


        training_args = TrainingArguments(
            output_dir=output_dir,
            learning_rate=learning_rate,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            num_train_epochs=3,
            weight_decay=weight_decay,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            metric_for_best_model='f1',
            push_to_hub=False,
            logging_dir=f"{model_name}/logs_{weight_decay}_fold_{fold}",
            logging_steps=100,
        )


        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_ds,
            eval_dataset=val_ds,
            tokenizer=tokenizer,
            data_collator=data_collator,
            # callbacks=[best_checkpoint_callback], 
            compute_metrics=compute_metrics,
            
        )
        try:
            trainer.train() 
        except:
            print('\033[91m'+ f"A crash occurred, restarting fold from checkpoint"+ '\033[0m')
            trainer.train(resume_from_checkpoint=True) #This is the same thing above but often restarting can make all the difference so let's try it

Topic		Replies	Views
Do not save runs (TensorBoard) after the epoch has ended 🤗Transformers	3	19	November 6, 2024
K fold cross validation Beginners	5	12937	July 29, 2023
Does Trainer load checkpoints from previous fold in k-fold Cross Validation? 🤗Transformers	0	596	August 20, 2023
Specifying K-fold splits in a dataset 🤗Datasets	1	590	March 20, 2024
High inconsistancies while Training Beginners	0	250	July 29, 2022

Kfolds leaking into subsequent folds

Related topics