I am building a longformer based classification model similar to this. If I want to tune my model, which parameters do I need to consider and are there any recommendations for them.
currently I am thinking about below parameters and their values as below
attention_window
=256, 512, or 1024
optim
=“adamw_torch, adamw_apex_fused, or adafactor”
weight_decay
=0,0.01,0.02
learning_rate
=5e-5,10e-5
what other parameters should I tune? Do I need to tune any of these?
num_train_epochs
, per_device_train_batch_size
, gradient_accumulation_steps
, per_device_eval_batch_size
, warmup_steps
, dataloader_num_workers
, lr_scheduler_type
,
Please let me know if there is any documentation about parameter tuning