Iβm getting an error in accelerate
whenever I try to run a sci-kit learn DummyClassifier
. Here is the error:
File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
Traceback (most recent call last):
File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
Traceback (most recent call last):
File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
wandb: Waiting for W&B process to finish... (failed 1).
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb:
wandb:
wandb:
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-15g4tigj
wandb: Find logs at: ./wandb/offline-run-20220801_115630-15g4tigj/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-331nmxul
wandb: Find logs at: ./wandb/offline-run-20220801_115630-331nmxul/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-ydvqw4bm
wandb: Find logs at: ./wandb/offline-run-20220801_115630-ydvqw4bm/logs
wandb:
wandb: Run history:
wandb: accuracy β
wandb: f1 β
wandb: loss βββββ
β
ββββ
ββββββββββββββββββββββββββββββ
wandb: precision β
wandb: recall β
wandb:
wandb: Run summary:
wandb: accuracy 0.0
wandb: f1 0.0
wandb: loss 270.52997
wandb: precision 0.0
wandb: recall 0.0
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-3kzno0ia
wandb: Find logs at: ./wandb/offline-run-20220801_115630-3kzno0ia/logs
INFO: WandB run closed
INFO: eval time = 8.820512533187866 seconds
INFO: Finished eval
----------------------------------------------------------------------------------------------------
INFO: STARTING DUMMY CLASSIFIER LOOP
Traceback (most recent call last):
File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
EPOCH 1/1: 25%|βββββββββββββββββββββββββββββββββββ | 1120/4456 [00:40<02:01, 27.47it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3815255) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-08-01_11:57:22
host : silver-surfer.airlab.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3815256)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2022-08-01_11:57:22
host : silver-surfer.airlab.com
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3815257)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2022-08-01_11:57:22
host : silver-surfer.airlab.com
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3815258)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-01_11:57:22
host : silver-surfer.airlab.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3815255)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
multi_gpu_launcher(args)
File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.
And here is an brief outline of the code:
from sklearn.dummy import DummyClassifier
from typing import (Dict, Union, List)
from torch import nn
from accelerate import Accelerator
class MyModelClass(nn.Module):
def __init__():
pass
def fit_dummy_classifier(self, strategy: Union[str, List[str]], eval_data, eval_labels) -> Dict[str, float]:
return_dict = {}
if isinstance(strategy, List):
for strat in strategy:
dummy_clf = DummyClassifier(strategy= strat)
dummy_clf.fit(eval_data, eval_labels)
dummy_clf.predict(eval_data)
score = dummy_clf.score(eval_data, eval_labels)
return_dict['_'.join([strat, 'accuracy'])] = score
else:
dummy_clf = DummyClassifier(strategy=strategy)
dummy_clf.fit(eval_data, eval_labels)
dummy_clf.predict(eval_data)
score = dummy_clf.score(eval_data, eval_labels)
return_dict['_'.join([strategy, 'accuracy'])] = score
return return_dict
accelerator = Accelerator()
my_model = MyModelClass()
#Run a pytorch-like training loop.
#...
#...
#Begin dummy classifier
accelerator.end_training()
eval_dataloader = accelerator.prepare(my_eval_dataloader)
dummy_model_scores = my_model.fit_dummy_classifier(['most_frequent', 'uniform'],
eval_dataloader.dataset['data'], eval_dataloader.dataset['labels'])
Iβm not entirely certain where Iβm going wrong. The method works on CPU but switching to accelerate seems to cause this error. Any advice is much appreciated!