I am building a longformer based classification model similar to this. If I want to tune my model, which parameters do I need to consider and are there any recommendations for them.
currently I am thinking about below parameters and their values as below
attention_window=256, 512, or 1024
optim=“adamw_torch, adamw_apex_fused, or adafactor”
weight_decay=0,0.01,0.02
learning_rate=5e-5,10e-5
what other parameters should I tune? Do I need to tune any of these?
num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, per_device_eval_batch_size, warmup_steps, dataloader_num_workers, lr_scheduler_type,
Please let me know if there is any documentation about parameter tuning