Scikit-learn DummyClassifier error when running Accelerate

aclifton314 · August 1, 2022, 6:14pm

I’m getting an error in accelerate whenever I try to run a sci-kit learn DummyClassifier. Here is the error:

  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
wandb: Waiting for W&B process to finish... (failed 1).
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb: Waiting for W&B process to finish... (failed 1).d)
wandb:                                                                                
wandb:                                                                                
wandb:                                                                                
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-15g4tigj
wandb: Find logs at: ./wandb/offline-run-20220801_115630-15g4tigj/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-331nmxul
wandb: Find logs at: ./wandb/offline-run-20220801_115630-331nmxul/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-ydvqw4bm
wandb: Find logs at: ./wandb/offline-run-20220801_115630-ydvqw4bm/logs
wandb: 
wandb: Run history:
wandb:  accuracy ▁
wandb:        f1 ▁
wandb:      loss █▆▄▄▅▅▄▃▃▅▄▃▄▂▃▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁
wandb: precision ▁
wandb:    recall ▁
wandb: 
wandb: Run summary:
wandb:  accuracy 0.0
wandb:        f1 0.0
wandb:      loss 270.52997
wandb: precision 0.0
wandb:    recall 0.0
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220801_115630-3kzno0ia
wandb: Find logs at: ./wandb/offline-run-20220801_115630-3kzno0ia/logs
INFO: WandB run closed
INFO: eval time = 8.820512533187866 seconds
INFO: Finished eval
----------------------------------------------------------------------------------------------------
INFO: STARTING DUMMY CLASSIFIER LOOP
Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 332, in <module>
    rffp_dummy_model_scores = rffp_model.fit_dummy_classifier(['most_frequent', 'uniform'],
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'fit_dummy_classifier'
   EPOCH 1/1:  25%|██████████████████████████████████▋                                                                                                       | 1120/4456 [00:40<02:01, 27.47it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3815255) of binary: /home/aclifton/anaconda3/envs/rffp/bin/python
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3815256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3815257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3815258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-01_11:57:22
  host      : silver-surfer.airlab.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3815255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/home/aclifton/anaconda3/envs/rffp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 528, in launch_command
    multi_gpu_launcher(args)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/accelerate/commands/launch.py", line 279, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'run_training.py']' returned non-zero exit status 1.

And here is an brief outline of the code:

from sklearn.dummy import DummyClassifier
from typing import (Dict, Union, List)
from torch import nn
from accelerate import Accelerator

class MyModelClass(nn.Module):
    def __init__():
        pass

    def fit_dummy_classifier(self, strategy: Union[str, List[str]], eval_data, eval_labels) -> Dict[str, float]:        
        return_dict = {}
        if isinstance(strategy, List):
            for strat in strategy:
                dummy_clf = DummyClassifier(strategy= strat)
                dummy_clf.fit(eval_data, eval_labels)
                dummy_clf.predict(eval_data)
                score = dummy_clf.score(eval_data, eval_labels)
                return_dict['_'.join([strat, 'accuracy'])] = score

        else:
            dummy_clf = DummyClassifier(strategy=strategy)
            dummy_clf.fit(eval_data, eval_labels)
            dummy_clf.predict(eval_data)
            score = dummy_clf.score(eval_data, eval_labels)
            return_dict['_'.join([strategy, 'accuracy'])] = score
        
        return return_dict

accelerator = Accelerator()
my_model = MyModelClass()

#Run a pytorch-like training loop.
#...
#...

#Begin dummy classifier
accelerator.end_training()
eval_dataloader = accelerator.prepare(my_eval_dataloader)
dummy_model_scores = my_model.fit_dummy_classifier(['most_frequent', 'uniform'],
eval_dataloader.dataset['data'], eval_dataloader.dataset['labels'])

I’m not entirely certain where I’m going wrong. The method works on CPU but switching to accelerate seems to cause this error. Any advice is much appreciated!

sgugger · August 1, 2022, 6:34pm

When you use distributed training, your model changes (because PyTorch needs it wrapped in a DistributedDataLoader), so it loses the fit_dummy_classifier method. If you do accelerator.unwrap_model(model) you’ll find back your original model, so you should do

dummy_model_scores = accelerator.unwrap_model(my_model).fit_dummy_classifier(
    ['most_frequent', 'uniform'], eval_dataloader.dataset['data'], eval_dataloader.dataset['labels']
)

at the end.

aclifton314 · August 1, 2022, 7:06pm

@sgugger, Ah I understand. Thank you for that feedback, I’ll be sure to implement it here soon.

Out of curiosity, is there a way to stop distributed training and return either to a single GPU or CPU? I could imagine a usecase where one wants to do distributed training and then aggregate and return everything to a single process to do something like make plots.

sgugger · August 1, 2022, 7:27pm

You’re launching separate processes for a whole script, so the only way is to just stop your script and launch a new one on one process.

aclifton314 · August 1, 2022, 7:40pm

Got it. Thank you @sgugger !

Topic		Replies	Views
Why my Accelerate just doesn't work? 🤗Accelerate	2	6323	March 7, 2022
Multi-GPU Distributed Training using Accelerate on Windows 🤗Accelerate	0	1552	August 9, 2023
Decreasing performance when using Accelerate 🤗Accelerate	1	2312	March 8, 2022
Cannot create distributed environment 🤗Accelerate	0	383	February 28, 2023
How to fix this error: AttributeError: 'AcceleratorState' object has no attribute 'distributed_type' 🤗Accelerate	0	1424	March 20, 2024

Scikit-learn DummyClassifier error when running Accelerate

Related topics