`tpu_cores` can only be 1, 8 or [<1-8>]

bhashithe · July 21, 2020, 6:01pm

First it was giving me errors due to a missing argument gpus but I fixed that by adding,

parser.add_argument(’–gpus’, type=int)

to the parser and setting the gpus parameter at the run_pl.sh file. Doing so I then came up with this error, I can understand that this is an error caused by PL, but it is a misconfiguration exception which means we should be able to fix it ourselves in our code.

I am trying out text-classification example with pytorch-lightning (run_pl.sh). But it seems to throw out an exception.

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.p
redictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.w
eight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predict
ions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another tas
k or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to
 be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly 
initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "run_pl_glue.py", line 186, in <module>
    trainer = generic_train(model, args)
  File "/lvol/bhashithe/transformers/examples/lightning_base.py", line 299, in generic_train
    **train_params,
  File "/lvol/bhashithe/env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in from_argparse_args
    return cls(**trainer_kwargs)
  File "/lvol/bhashithe/env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 468, in __init__
    self.tpu_cores = _parse_tpu_cores(tpu_cores)
  File "/lvol/bhashithe/env/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 526, in _parse_tpu_c
ores
    raise MisconfigurationException("`tpu_cores` can only be 1, 8 or [<1-8>]")
pytorch_lightning.utilities.exceptions.MisconfigurationException: `tpu_cores` can only be 1, 8 or [<1-8>]

The environment is all up to date, which has 8 GPUs (V100) and no TPUs.

stas · July 25, 2020, 12:54am

Indeed, gpus is missing from argparse.

The following will fix your error:

-    parser.add_argument("--n_tpu_cores", dest="tpu_cores", type=int, default=0)
+    parser.add_argument("--n_tpu_cores", dest="tpu_cores", type=int)

but then it fails again elsewhere. I’m looking at it now.

bhashithe · July 25, 2020, 1:06am

Oh thank you, i will check it out. As a workaround I had commented out that line, I think this should work.

stas · July 25, 2020, 3:22am

Here is my work in progress on the multiple breakages of this script:

github.com/huggingface/transformers

examples/text-classification/run_pl.sh multiple problems

huggingface:master ← stas00:issue-1

opened 03:21AM - 25 Jul 20 UTC

stas00

+10 -1

Fixing this sequence of errors - each fix required for the next error running…: ``` cd examples/text-classification ./run_pl.sh ``` error 1: ``` Traceback (most recent call last): File "run_pl_glue.py", line 183, in <module> trainer = generic_train(model, args) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 289, in generic_train if args.gpus > 1: AttributeError: 'Namespace' object has no attribute 'gpus' ``` solution: added `--n_gpus` arg error 2: ``` Traceback (most recent call last): File "run_pl_glue.py", line 183, in <module> trainer = generic_train(model, args) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 300, in generic_train **train_params, File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in from_argparse_args return cls(**trainer_kwargs) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 468, in __init__ self.tpu_cores = _parse_tpu_cores(tpu_cores) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 526, in _parse_tpu_cores raise MisconfigurationException("`tpu_cores` can only be 1, 8 or [<1-8>]") pytorch_lightning.utilities.exceptions.MisconfigurationException: `tpu_cores` can only be 1, 8 or [<1-8>] ``` solution: removed `default=0` for `tpu_cores` error 3: ``` Traceback (most recent call last): File "run_pl_glue.py", line 183, in <module> trainer = generic_train(model, args) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 304, in generic_train trainer.fit(model) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in fit model.setup('fit') File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 125, in setup dataloader = self.get_dataloader("train", train_batch_size) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__ type(self).__name__, name)) AttributeError: 'GLUETransformer' object has no attribute 'get_dataloader' ``` solution: added a wrapper - but it's incomplete - what to do with the `shuffle` arg? error 4: ``` Traceback (most recent call last): File "run_pl_glue.py", line 187, in <module> trainer = generic_train(model, args) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 306, in generic_train trainer.fit(model) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit results = self.run_pretrain_routine(model) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine self.train() File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train self.run_training_epoch() File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 632, in run_training_batch self.hiddens File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 776, in optimizer_closure hiddens) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 956, in training_forward output = self.model.training_step(*args) File "run_pl_glue.py", line 44, in training_step tensorboard_logs = {"loss": loss, "rate": self.lr_scheduler.get_last_lr()[-1]} File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__ type(self).__name__, name)) AttributeError: 'GLUETransformer' object has no attribute 'lr_scheduler' ``` solution: I'm not sure how it used to work, but there is no `self.lr_scheduler` in pytorch-lightning (PL), I found one here: `self.trainer.lr_schedulers[0]["scheduler"]` and set this attribute. I have no idea whether this always works. Someone who wrote this script would probably know better where the missing attribute has gone. It's set inside `def fit`/CPU but inside the `trainer` object and not `nn.Module`. Further notes: `run_pl.sh` invokes PL in CPU mode, despite available GPU. I haven't tested this on gpu yet - I just saw during debug that PL [inits optimizers](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L1096) just before it runs `run_pretrain_routine`, so I didn't find an easy PL predefined method where one could preset `self.lr_scheduler`. Perhaps PL API has changed and caused this issue? error 5: ``` Traceback (most recent call last): File "run_pl_glue.py", line 218, in <module> trainer = generic_train(model, args) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 305, in generic_train trainer.fit(model) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit results = self.run_pretrain_routine(model) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine self.train() File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train self.run_training_epoch() File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx) File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 671, in run_training_batch self.on_batch_end() File "/home/stas/anaconda3/envs/main/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 82, in on_batch_end callback.on_batch_end(self, self.get_model()) File "/mnt/nvme1/code/huggingface/transformers-issue-1/examples/lightning_base.py", line 198, in on_batch_end lrs = {f"lr_group_{i}": lr for i, lr in enumerate(self.lr_scheduler.get_lr())} AttributeError: 'LoggingCallback' object has no attribute 'lr_scheduler' ``` solution: see notes for error 4. with these fixes the code at least starts training, I didn't test further, since clearly there is a better way. Only the fixes for the first 2 errors are obviously correct to merge. All the fixes are in one PR, as one can't move to the next error, before fixing the previous ones.

bhashithe · July 25, 2020, 4:02am

Haha literally about 1.5 hours ago i sent the same pull request, but instead of wrapping the load_dataset() i renamed it, i liked yours better.

stas · July 25, 2020, 4:28am

Great minds think alike - I will link those PRs.

Haha literally about 1.5 hours ago i sent the same pull request, but instead of wrapping the load_dataset() i renamed it, i liked yours better.

I also wasn’t sure whether it has been used somewhere else, so left the original intact.

Topic		Replies	Views
ERROR: Could not find a version that satisfies the requirement torch==1.7.1+cpu Beginners	17	25167	December 15, 2020
Unable to load checkpoint after finetuning Intermediate	5	4612	February 21, 2024
How to correct TypeError: zip argument #1 must support iteration training in multiple GPU Intermediate	1	893	February 28, 2023
Bert For Sequence Classification with Keras fine tuning on GPU error Beginners	1	372	April 2, 2022
OutOfMemoryError: CUDA out of memory Beginners	0	931	June 30, 2023

`tpu_cores` can only be 1, 8 or [<1-8>]

Related topics