Run_clm.py is very slow on gpu (used to take seconds)

i have a feeling that, after updating to transformers version 4.7.0.dev0 fine tuning with run_clm.py takes 5-6 hours on a < 1mb training file.

anyone else experienced this? CUDA and CUDNN are installed and it worked some weeks ago.

(base) E:\gpt-2-test>python run_clm.py --model_type gpt2-medium --model_name_or_path dbmdz/german-gpt2 --train_file “chatlog.txt” --do_train --per_device_train_batch_size 1 --save_steps -1 --num_train_epochs 1 --block_size 512 --output_dir=/finetune_out
2021-05-20 15:57:44.117497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
05/20/2021 15:57:55 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
05/20/2021 15:57:55 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=/finetune_out, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs\May20_15-57-55_Corouscant, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/finetune_out, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=[‘tensorboard’], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
05/20/2021 15:57:55 - WARNING - datasets.builder - Using custom data configuration default-bfb466efc3ef149a
05/20/2021 15:57:55 - WARNING - datasets.builder - Reusing dataset text (C:\Users\swilg.cache\huggingface\datasets\text\default-bfb466efc3ef149a\0.0.0\293ecb642f9fca45b44ad1f90c8445c54b9d80b95ab3fca3cfa5e1e3d85d4a57)
[INFO|configuration_utils.py:515] 2021-05-20 15:57:55,706 >> loading configuration file dbmdz/german-gpt2\config.json
[INFO|configuration_utils.py:553] 2021-05-20 15:57:55,707 >> Model config GPT2Config {
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 52000,
“embd_pdrop”: 0.1,
“eos_token_id”: 52000,
“gradient_checkpointing”: false,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“resid_pdrop”: 0.1,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.7.0.dev0”,
“use_cache”: true,
“vocab_size”: 52000
}

[INFO|configuration_utils.py:515] 2021-05-20 15:57:55,708 >> loading configuration file dbmdz/german-gpt2\config.json
[INFO|configuration_utils.py:553] 2021-05-20 15:57:55,708 >> Model config GPT2Config {
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 52000,
“embd_pdrop”: 0.1,
“eos_token_id”: 52000,
“gradient_checkpointing”: false,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“resid_pdrop”: 0.1,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.7.0.dev0”,
“use_cache”: true,
“vocab_size”: 52000
}

[INFO|tokenization_utils_base.py:1651] 2021-05-20 15:57:55,711 >> Didn’t find file dbmdz/german-gpt2\tokenizer.json. We won’t load it.
[INFO|tokenization_utils_base.py:1651] 2021-05-20 15:57:55,711 >> Didn’t find file dbmdz/german-gpt2\added_tokens.json. We won’t load it.
[INFO|tokenization_utils_base.py:1651] 2021-05-20 15:57:55,727 >> Didn’t find file dbmdz/german-gpt2\special_tokens_map.json. We won’t load it.
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,749 >> loading file dbmdz/german-gpt2\vocab.json
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,749 >> loading file dbmdz/german-gpt2\merges.txt
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,771 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,776 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,811 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-05-20 15:57:55,815 >> loading file dbmdz/german-gpt2\tokenizer_config.json
[WARNING|tokenization_utils_base.py:1846] 2021-05-20 15:57:55,933 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|modeling_utils.py:1153] 2021-05-20 15:57:55,940 >> loading weights file dbmdz/german-gpt2\pytorch_model.bin
[INFO|modeling_utils.py:1339] 2021-05-20 15:57:57,478 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1347] 2021-05-20 15:57:57,479 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at dbmdz/german-gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
05/20/2021 15:57:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\swilg.cache\huggingface\datasets\text\default-bfb466efc3ef149a\0.0.0\293ecb642f9fca45b44ad1f90c8445c54b9d80b95ab3fca3cfa5e1e3d85d4a57\cache-1e772a76312f9b12.arrow
05/20/2021 15:57:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\swilg.cache\huggingface\datasets\text\default-bfb466efc3ef149a\0.0.0\293ecb642f9fca45b44ad1f90c8445c54b9d80b95ab3fca3cfa5e1e3d85d4a57\cache-0f5651d8588b8b3e.arrow
[INFO|trainer.py:1145] 2021-05-20 15:58:06,183 >> ***** Running training *****
[INFO|trainer.py:1146] 2021-05-20 15:58:06,199 >> Num examples = 563
[INFO|trainer.py:1147] 2021-05-20 15:58:06,215 >> Num Epochs = 1
[INFO|trainer.py:1148] 2021-05-20 15:58:06,230 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1149] 2021-05-20 15:58:06,246 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1150] 2021-05-20 15:58:06,262 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1151] 2021-05-20 15:58:06,278 >> Total optimization steps = 563