Do not save runs (TensorBoard) after the epoch has ended

k = 5 
kf = KFold(n_splits=k, shuffle=True, random_state=42)
fold_number = 1
for fold, (train_indices, val_indices) in enumerate(kf.split(train_dataset)):
    print(f"Training on fold {fold + 1}/{k}...")

    train_subset = train_dataset.select(train_indices)
    val_subset = train_dataset.select(val_indices)
    
    model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=157)
    
    training_args = TrainingArguments(
        output_dir=f'./checkpoints/DNABert2-117M/GenusTax/{k}-fold/{model_part}_fold_{fold_number}',
        num_train_epochs=30,
        per_device_train_batch_size=24,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        warmup_ratio=0.5,
        learning_rate=5e-5,
        weight_decay=0.0001,
        logging_steps=10,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        save_total_limit=2,
        metric_for_best_model='f1',
        load_best_model_at_end=True,
        greater_is_better=True,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_subset,
        eval_dataset=val_subset,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
        compute_metrics=eval_predict,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]  
    )
    
    trainer.train()
    
    metrics = trainer.evaluate()
    print(metrics)
    test_results.append(metrics)
    
    fold_number += 1

Only checkpoints had been saved.In the previous task, the runs folder was saved, the same code.
peft==0.12.0
transformers==4.44.0

Eventually, I discovered the truth: I had not installed the TensorBoard package in my new conda environment… :see_no_evil:

1 Like

Oops…It happens.:laughing:

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.