I am now training summarization model with nohup bash ~
since nohup writes all the tqdm logs, the file size increases too much. I am fine with some data mapping or training logs. but, there are some too long logs in between the training logs.
Now I am using trainer from transformer and wandb.
I canβt identify what this progress bar isβ¦
the code snippet is here
if args.do_train:
wandb.init(name=f"{model_name_only}-data:{args.dataset}:{args.train_size}-{random_num}", project=f'{model_name_only}-{args.train_rl_size}-{random_num}', settings=wandb.Settings(_service_wait=3000))
print('train bart..')
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
print(f'output directory: {output_dir}')
training_args = Seq2SeqTrainingArguments(
output_dir=args.output_model_dir,
num_train_epochs=args.epoch,
warmup_steps=500,
per_device_train_batch_size=args.train_batch_size,
per_device_eval_batch_size=args.test_batch_size,
weight_decay=0.01,
logging_steps=500,
evaluation_strategy='steps',
eval_steps=500,
save_steps=1e6,
predict_with_generate=True,
remove_unused_columns=True,
hub_model_id=output_dir.split('/')[-1],
push_to_hub=args.push_to_hub,
gradient_accumulation_steps=16
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
data_collator=seq2seq_data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
print('done')
if args.push_to_hub:
trainer.save_model(output_dir)
print(f'save model to {output_dir}')
trainer.push_to_hub()
print('push model to hub')
if args.do_rl:
and now is the nohup logβ¦
Map: 93%|ββββββββββ| 4612/4969 [00:04<00:00, 1227.55 examples/s]
Map: 96%|ββββββββββ| 4778/4969 [00:04<00:00, 1317.31 examples/s]
Map: 100%|ββββββββββ| 4969/4969 [00:04<00:00, 1108.47 examples/s]
Map: 100%|ββββββββββ| 4969/4969 [00:04<00:00, 1130.69 examples/s]
wandb: Currently logged in as: baek26. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.6 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.4
wandb: Run data is saved locally in /hdd/hdd2/baek26/Ours/MDO/wandb/run-20240417_001446-a6ortut6
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run facebook-bart-base-data:all:None-323
wandb: βοΈ View project at https://wandb.ai/baek26/facebook-bart-base-None-323
wandb: π View run at https://wandb.ai/baek26/facebook-bart-base-None-323/runs/a6ortut6
/home/guest-bje/.local/share/virtualenvs/Ours-2zhE1riw/lib/python3.8/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
done..
train bart..
output directory: ./checkpoints/MDO/all_323_bart-base
0%| | 0/49760 [00:00<?, ?it/s]
0%| | 1/49760 [00:06<83:29:06, 6.04s/it]
0%| | 2/49760 [00:08<56:15:16, 4.07s/it]
0%| | 3/49760 [00:11<47:42:26, 3.45s/it]
0%| | 4/49760 [00:14<43:31:26, 3.15s/it]
0%| | 5/49760 [00:16<41:31:17, 3.00s/it]
0%| | 6/49760 [00:19<40:14:46, 2.91s/it]
0%| | 7/49760 [00:22<39:29:00, 2.86s/it]
0%| | 8/49760 [00:25<38:44:46, 2.80s/it]
0%| | 9/49760 [00:27<38:36:48, 2.79s/it]
0%| | 10/49760 [00:30<38:23:16, 2.78s/it]
0%| | 10/49760 [00:30<38:23:16, 2.78s/it]
0%| | 11/49760 [00:33<38:11:39, 2.76s/it]
0%| | 12/49760 [00:35<37:44:02, 2.73s/it]
0%| | 13/49760 [00:38<37:30:46, 2.71s/it]
0%| | 14/49760 [00:41<37:18:40, 2.70s/it]
...
1%| | 491/49760 [23:04<36:14:18, 2.65s/it]
1%| | 492/49760 [23:07<36:11:57, 2.65s/it]
1%| | 493/49760 [23:09<36:15:00, 2.65s/it]
1%| | 494/49760 [23:12<36:11:42, 2.64s/it]
1%| | 495/49760 [23:14<36:08:02, 2.64s/it]
1%| | 496/49760 [23:17<36:06:57, 2.64s/it]
1%| | 497/49760 [23:20<36:14:44, 2.65s/it]
1%| | 498/49760 [23:23<36:33:00, 2.67s/it]
1%| | 499/49760 [23:25<36:58:12, 2.70s/it]
1%| | 500/49760 [23:28<37:10:19, 2.72s/it]
1%| | 500/49760 [23:28<37:10:19, 2.72s/it]/home/guest-bje/.local/share/virtualenvs/Ours-2zhE1riw/lib/python3.8/site-packages/transformers/generation/utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
{'loss': 9.467, 'grad_norm': 24.954456329345703, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
{'loss': 8.9265, 'grad_norm': 16.096410751342773, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}
{'loss': 8.3355, 'grad_norm': 11.878803253173828, 'learning_rate': 3e-06, 'epoch': 0.01}
{'loss': 7.7632, 'grad_norm': 10.747937202453613, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.01}
{'loss': 7.2561, 'grad_norm': 9.314805030822754, 'learning_rate': 5e-06, 'epoch': 0.01}
{'loss': 6.9627, 'grad_norm': 12.294201850891113, 'learning_rate': 6e-06, 'epoch': 0.01}
{'loss': 6.4995, 'grad_norm': 16.7220401763916, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.01}
{'loss': 5.8852, 'grad_norm': 22.519121170043945, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.02}
{'loss': 5.0685, 'grad_norm': 23.565885543823242, 'learning_rate': 9e-06, 'epoch': 0.02}
{'loss': 4.5524, 'grad_norm': 22.735116958618164, 'learning_rate': 1e-05, 'epoch': 0.02}
{'loss': 4.1538, 'grad_norm': 22.052845001220703, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.02}
...
{'loss': 1.1735, 'grad_norm': 1.2184255123138428, 'learning_rate': 4.8e-05, 'epoch': 0.1}
{'loss': 1.2038, 'grad_norm': 1.1376460790634155, 'learning_rate': 4.9e-05, 'epoch': 0.1}
{'loss': 1.2235, 'grad_norm': 1.3213053941726685, 'learning_rate': 5e-05, 'epoch': 0.1}
0%| | 0/3777 [00:00<?, ?it/s]e[A
0%| | 2/3777 [00:00<10:46, 5.84it/s]e[A
0%| | 3/3777 [00:00<20:53, 3.01it/s]e[A
0%| | 4/3777 [00:01<21:45, 2.89it/s]e[A
0%| | 5/3777 [00:01<28:40, 2.19it/s]e[A
0%| | 6/3777 [00:02<25:43, 2.44it/s]e[A
0%| | 7/3777 [00:02<24:17, 2.59it/s]e[A
0%| | 8/3777 [00:03<24:52, 2.53it/s]e[A
0%| | 9/3777 [00:03<28:24, 2.21it/s]e[A
0%| | 10/3777 [00:04<29:12, 2.15it/s]e[A
0%| | 11/3777 [00:04<26:26, 2.37it/s]e[A
...
99%|ββββββββββ| 3754/3777 [22:35<00:17, 1.31it/s]e[A
99%|ββββββββββ| 3755/3777 [22:35<00:13, 1.62it/s]e[A
99%|ββββββββββ| 3756/3777 [22:36<00:13, 1.55it/s]e[A
99%|ββββββββββ| 3757/3777 [22:37<00:14, 1.35it/s]e[A
99%|ββββββββββ| 3758/3777 [22:38<00:13, 1.41it/s]e[A
100%|ββββββββββ| 3759/3777 [22:38<00:10, 1.77it/s]e[A
100%|ββββββββββ| 3760/3777 [22:39<00:10, 1.56it/s]e[A
100%|ββββββββββ| 3761/3777 [22:40<00:11, 1.38it/s]e[A
100%|ββββββββββ| 3762/3777 [22:40<00:10, 1.49it/s]e[A
100%|ββββββββββ| 3763/3777 [22:40<00:07, 1.85it/s]e[A
100%|ββββββββββ| 3764/3777 [22:41<00:07, 1.85it/s]e[A
100%|ββββββββββ| 3765/3777 [22:42<00:07, 1.56it/s]e[A
100%|ββββββββββ| 3766/3777 [22:43<00:07, 1.44it/s]e[A
100%|ββββββββββ| 3767/3777 [22:43<00:05, 1.76it/s]e[A
100%|ββββββββββ| 3768/3777 [22:43<00:04, 1.91it/s]e[A
100%|ββββββββββ| 3769/3777 [22:44<00:05, 1.59it/s]e[A
100%|ββββββββββ| 3770/3777 [22:45<00:05, 1.38it/s]e[A
100%|ββββββββββ| 3771/3777 [22:45<00:03, 1.70it/s]e[A
100%|ββββββββββ| 3772/3777 [22:46<00:02, 2.05it/s]e[A
100%|ββββββββββ| 3773/3777 [22:47<00:02, 1.61it/s]e[A
100%|ββββββββββ| 3774/3777 [22:48<00:02, 1.31it/s]e[A
100%|ββββββββββ| 3775/3777 [22:48<00:01, 1.65it/s]e[A
100%|ββββββββββ| 3776/3777 [22:49<00:00, 1.40it/s]e[A
100%|ββββββββββ| 3777/3777 [22:50<00:00, 1.50it/s]e[A
e[A
100%|ββββββββββ| 3777/3777 [23:31<00:00, 1.50it/s]e[A
1%| | 500/49760 [47:00<37:10:19, 2.72s/it]
e[A
1%| | 501/49760 [47:04<5838:06:57, 426.67s/it]
1%| | 502/49760 [47:07<4100:50:18, 299.71s/it]
1%| | 503/49760 [47:10<2881:12:24, 210.58s/it]
1%| | 504/49760 [47:13<2027:26:41, 148.18s/it]
1%| | 505/49760 [47:15<1429:52:12, 104.51s/it]
1%| | 506/49760 [47:18<1011:33:41, 73.94s/it]
1%| | 507/49760 [47:20<718:43:33, 52.53s/it
From this example, you can notice the lines with 3777 generates one empty line for one log and it is generated too frequently⦠It is so annoying. What is it? and how can I fix it?