Scikit-learn DummyClassifier error when running Accelerate

I’m getting an error in accelerate whenever I try to run a sci-kit learn DummyClassifier. Here is the error:

  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
wandb: Waiting for W&B process to finish... (failed 1).
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb:                                                                                
wandb:                                                                                
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-15g4tigj
wandb: Find logs at: ./wandb/offline-run-20220801_115630-15g4tigj/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-331nmxul
wandb: Find logs at: ./wandb/offline-run-20220801_115630-331nmxul/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-ydvqw4bm
wandb: Find logs at: ./wandb/offline-run-20220801_115630-ydvqw4bm/logs
wandb: 
wandb: Run history:
wandb:  accuracy ▁
wandb:        f1 ▁
wandb:      loss β–ˆβ–†β–„β–„β–…β–…β–„β–ƒβ–ƒβ–…β–„β–ƒβ–„β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–β–β–β–β–β–β–β–
wandb: precision ▁
wandb:    recall ▁
wandb: 
wandb: Run summary:
wandb:  accuracy 0.0
wandb:        f1 0.0
wandb:      loss 270.52997
wandb: precision 0.0
wandb:    recall 0.0
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-3kzno0ia
wandb: Find logs at: ./wandb/offline-run-20220801_115630-3kzno0ia/logs
INFO: WandB run closed
INFO: eval time = 8.820512533187866 seconds
INFO: Finished eval
----------------------------------------------------------------------------------------------------
INFO: STARTING DUMMY CLASSIFIER LOOP
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
   EPOCH 1/1:  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                                                                                                       | 1120/4456 [00:40<02:01, 27.47it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3815255) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3815256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3815257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3815258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3815255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
    multi_gpu_launcher(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.

And here is an brief outline of the code:

from sklearn.dummy import DummyClassifier
from typing import (Dict, Union, List)
from torch import nn
from accelerate import Accelerator

class MyModelClass(nn.Module):
    def __init__():
        pass

    def fit_dummy_classifier(self, strategy: Union[str, List[str]], eval_data, eval_labels) -> Dict[str, float]:        
        return_dict = {}
        if isinstance(strategy, List):
            for strat in strategy:
                dummy_clf = DummyClassifier(strategy= strat)
                dummy_clf.fit(eval_data, eval_labels)
                dummy_clf.predict(eval_data)
                score = dummy_clf.score(eval_data, eval_labels)
                return_dict['_'.join([strat, 'accuracy'])] = score

        else:
            dummy_clf = DummyClassifier(strategy=strategy)
            dummy_clf.fit(eval_data, eval_labels)
            dummy_clf.predict(eval_data)
            score = dummy_clf.score(eval_data, eval_labels)
            return_dict['_'.join([strategy, 'accuracy'])] = score
        
        return return_dict

accelerator = Accelerator()
my_model = MyModelClass()

#Run a pytorch-like training loop.
#...
#...

#Begin dummy classifier
accelerator.end_training()
eval_dataloader = accelerator.prepare(my_eval_dataloader)
dummy_model_scores = my_model.fit_dummy_classifier(['most_frequent', 'uniform'],
eval_dataloader.dataset['data'], eval_dataloader.dataset['labels'])

I’m not entirely certain where I’m going wrong. The method works on CPU but switching to accelerate seems to cause this error. Any advice is much appreciated!

When you use distributed training, your model changes (because PyTorch needs it wrapped in a DistributedDataLoader), so it loses the fit_dummy_classifier method. If you do accelerator.unwrap_model(model) you’ll find back your original model, so you should do

dummy_model_scores = accelerator.unwrap_model(my_model).fit_dummy_classifier(
    ['most_frequent', 'uniform'], eval_dataloader.dataset['data'], eval_dataloader.dataset['labels']
)

at the end.

@sgugger, Ah I understand. Thank you for that feedback, I’ll be sure to implement it here soon.

Out of curiosity, is there a way to stop distributed training and return either to a single GPU or CPU? I could imagine a usecase where one wants to do distributed training and then aggregate and return everything to a single process to do something like make plots.

You’re launching separate processes for a whole script, so the only way is to just stop your script and launch a new one on one process.

Got it. Thank you @sgugger !