CUDA Deadlock while training DETR

I was following the guideline for object detection in the guidelines to train DAB-DETR on my custom dataset. I have controlled collate_fn function and it worked as expected. On top of that, no issues with the dataset or the inputs format were spotted. The trainer and training arguments objects get initialized perfectly. However as the train method is called, I receive:

/usr/local/lib/python3.12/dist-packages/notebook/notebookapp.py:191: SyntaxWarning: invalid escape sequence '\/'
  | |_| | '_ \/ _` / _` |  _/ -_)

after this warning, nothing happens, no memory on gpu gets allocated. It just stays like that seeming to be running without doing anything. I am collab. When I try stopping the cell, it does not work and even restarting the runtime gets stuck, so only escape method is disconnecting from the runtime. Did anybody have similar experiences or know a solution?

Training setting is as following:

training_args = TrainingArguments(
    output_dir=checkpoint_path_huggingface,
    num_train_epochs=30,
    fp16=False,
    per_device_train_batch_size=BATCH_SIZE,
    dataloader_num_workers=0,
    dataloader_pin_memory=False,
    disable_tqdm=False,
    report_to=None,
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    weight_decay=1e-4,
    max_grad_norm=0.1,
    metric_for_best_model="eval_map",
    greater_is_better=True,
    load_best_model_at_end=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=processor,
    data_collator=collate_fn,
    compute_metrics=eval_compute_metrics_fn,
)
1 Like

That warning is the kind you can safely ignore. For example, if you’re storing your custom dataset on Google Drive, it seems to stall because it’s too slow.

Thank you very much, the issue got fixed.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.