Getting error while fine tuning Deberta v3 Large

I have been trying to fine tune the model using the instructions given in - microsoft/deberta-v3-large · Hugging Face

but I am getting

ImportError: This example requires a source install from HuggingFace Transformers (see https://huggingface.co/transformers/installation.html#installing-from-source), but the version found is 4.11.3.

so I cloned the transformers repo on my device and now I am getting an error saying it can’t run the run_glue.py.

What am I doing incorrectly?

Thank you.

full error looks like

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
/usr/bin/python3: can't open file ' run_glue.py': [Errno 2] No such file or directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 6570) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/nikhil/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
 run_glue.py FAILED
------------------------------------------------------------

and although it says no such file found, the run_glue.py file is there in the correct folder

Can you post the command you used?

Typically, if you just run python run_glue.py, then you must be in the directory of the run_glue.py script, otherwise it won’t find it.

You can of course also do: python transformers/examples/pytorch/text-classification/run_glue.py, if you run from the root of the Transformers repo.

I ran:-

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_glue.py \
  --model_name_or_path microsoft/deberta-v3-large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 50 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 6e-6 \
  --num_train_epochs 2 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

in the commands you are suggesting, how do input which model I want to train?

This line determines which model you’d like to fine-tune. It can be a model name from one of the models on the hub, or a path to a local folder.

However, as you’re getting a “/usr/bin/python3: can’t open file ’ run_glue.py’: [Errno 2] No such file or directory”, this means you are probably running the script from a directory outside the “examples/pytorch/text-classification” directory of Transformers.

1 Like

If I run just python3 run_glue.py then I get this

python3 run_glue.py
Traceback (most recent call last):
  File "run_glue.py", line 50, in <module>
    check_min_version("4.13.0.dev0")
  File "/home/nikhil/.local/lib/python3.6/site-packages/transformers/utils/__init__.py", line 35, in check_min_version
    "Check out https://huggingface.co/transformers/examples.html for the examples corresponding to other "
ImportError: This example requires a source install from HuggingFace Transformers (see `https://huggingface.co/transformers/installation.html#installing-from-source`), but the version found is 4.11.3.
Check out https://huggingface.co/transformers/examples.html for the examples corresponding to other versions of HuggingFace Transformers.

As explained by the error, you need to install Transformers from source.

What I usually do is using the following command:

!rm -r transformers
!git clone https://github.com/huggingface/transformers.git
!cd transformers
!pip install -q ./transformers

same error no difference

I’ve created a notebook for you: Google Colab

Thank you so much. Not getting an error on this one

after training where is the trained model saved? I am only seeing these files and not the model.

image

Everything is stored in the --output_dir you specified.

the image is that of the output dir