Fixing this sequence of errors - each fix required for the next error
runningâŚ:
```
cd examples/text-classification
./run_pl.sh
```
error 1:
```
Traceback (most recent call last):
File "run_pl_glue.py", line 183, in <module>
trainer = generic_train(model, args)
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 289, in generic_train
if args.gpus > 1:
AttributeError: 'Namespace' object has no attribute 'gpus'
```
solution: added `--n_gpus` arg
error 2:
```
Traceback (most recent call last):
File "run_pl_glue.py", line 183, in <module>
trainer = generic_train(model, args)
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 300, in generic_train
**train_params,
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in from_argparse_args
return cls(**trainer_kwargs)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 468, in __init__
self.tpu_cores = _parse_tpu_cores(tpu_cores)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 526, in _parse_tpu_cores
raise MisconfigurationException("`tpu_cores` can only be 1, 8 or [<1-8>]")
pytorch_lightning.utilities.exceptions.MisconfigurationException: `tpu_cores` can only be 1, 8 or [<1-8>]
```
solution: removed `default=0` for `tpu_cores`
error 3:
```
Traceback (most recent call last):
File "run_pl_glue.py", line 183, in <module>
trainer = generic_train(model, args)
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 304, in generic_train
trainer.fit(model)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in fit
model.setup('fit')
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 125, in setup
dataloader = self.get_dataloader("train", train_batch_size)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'GLUETransformer' object has no attribute 'get_dataloader'
```
solution: added a wrapper - but it's incomplete - what to do with the `shuffle` arg?
error 4:
```
Traceback (most recent call last):
File "run_pl_glue.py", line 187, in <module>
trainer = generic_train(model, args)
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 306, in generic_train
trainer.fit(model)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
results = self.run_pretrain_routine(model)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
self.train()
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
self.run_training_epoch()
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 632, in run_training_batch
self.hiddens
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 776, in optimizer_closure
hiddens)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 956, in training_forward
output = self.model.training_step(*args)
File "run_pl_glue.py", line 44, in training_step
tensorboard_logs = {"loss": loss, "rate": self.lr_scheduler.get_last_lr()[-1]}
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'GLUETransformer' object has no attribute 'lr_scheduler'
```
solution: I'm not sure how it used to work, but there is no `self.lr_scheduler` in pytorch-lightning (PL), I found one here: `self.trainer.lr_schedulers[0]["scheduler"]` and set this attribute. I have no idea whether this always works. Someone who wrote this script would probably know better where the missing attribute has gone. It's set inside `def fit`/CPU but inside the `trainer` object and not `nn.Module`.
Further notes:
`run_pl.sh` invokes PL in CPU mode, despite available GPU. I haven't tested this on gpu yet - I just saw during debug that PL [inits optimizers](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L1096) just before it runs `run_pretrain_routine`, so I didn't find an easy PL predefined method where one could preset `self.lr_scheduler`.
Perhaps PL API has changed and caused this issue?
error 5:
```
Traceback (most recent call last):
File "run_pl_glue.py", line 218, in <module>
trainer = generic_train(model, args)
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 305, in generic_train
trainer.fit(model)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
results = self.run_pretrain_routine(model)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
self.train()
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
self.run_training_epoch()
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx)
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 671, in run_training_batch
self.on_batch_end()
File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 82, in on_batch_end
callback.on_batch_end(self, self.get_model())
File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 198, in on_batch_end
lrs = {f"lr_group_{i}": lr for i, lr in enumerate(self.lr_scheduler.get_lr())}
AttributeError: 'LoggingCallback' object has no attribute 'lr_scheduler'
```
solution: see notes for error 4.
with these fixes the code at least starts training, I didn't test further, since clearly there is a better way. Only the fixes for the first 2 errors are obviously correct to merge.
All the fixes are in one PR, as one can't move to the next error, before fixing the previous ones.