Thanks for your work, I tried deepspeed
in Wav2vec2-finetune
and when I use the configuration file “ds_config_zero2.json”, it reports the following error:
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 259, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
So I made a change in the function _prepare_input()
by using “.half()”:
def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:
"""
Prepare :obj:`inputs` before feeding them to the model, converting them to tensors if they are not already and
handling potential state.
"""
# Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
for k, v in inputs.items():
if isinstance(v, torch.Tensor):
# inputs[k] = v.to(self.args.device)
inputs[k] = v.to(self.args.device).half() # add .half() here
if self.args.past_index >= 0 and self._past is not None:
inputs["mems"] = self._past
return inputs
I don’t know if this is the right way to change it, but then I got a new error:
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/functional.py", line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1607370141920/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2ed6c508b2 in /root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2ed6ea2982 in /root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2ed6c3bb7d in /root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fea0a (0x7f2f13f8da0a in /root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5feab6 (0x7f2f13f8dab6 in /root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1a3f6e (0x55c7aa0a8f6e in /root/anaconda3/envs/huggingface/bin/python)
frame #6: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #7: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #8: <unknown function> + 0x10e318 (0x55c7aa013318 in /root/anaconda3/envs/huggingface/bin/python)
frame #9: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #10: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #11: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #12: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #13: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #14: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #15: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #16: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #17: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #18: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #19: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #20: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #21: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #22: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #23: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #24: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #25: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #26: <unknown function> + 0x10e34c (0x55c7aa01334c in /root/anaconda3/envs/huggingface/bin/python)
frame #27: <unknown function> + 0x216141 (0x55c7aa11b141 in /root/anaconda3/envs/huggingface/bin/python)
frame #28: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #29: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #30: <unknown function> + 0x10e318 (0x55c7aa013318 in /root/anaconda3/envs/huggingface/bin/python)
frame #31: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #32: <unknown function> + 0x10e3a8 (0x55c7aa0133a8 in /root/anaconda3/envs/huggingface/bin/python)
frame #33: <unknown function> + 0x1a3f50 (0x55c7aa0a8f50 in /root/anaconda3/envs/huggingface/bin/python)
frame #34: <unknown function> + 0xfd9c8 (0x55c7aa0029c8 in /root/anaconda3/envs/huggingface/bin/python)
frame #35: <unknown function> + 0x10eb77 (0x55c7aa013b77 in /root/anaconda3/envs/huggingface/bin/python)
frame #36: <unknown function> + 0x10eb8d (0x55c7aa013b8d in /root/anaconda3/envs/huggingface/bin/python)
frame #37: PyDict_SetItem + 0x502 (0x55c7aa068da2 in /root/anaconda3/envs/huggingface/bin/python)
frame #38: PyDict_SetItemString + 0x4f (0x55c7aa06986f in /root/anaconda3/envs/huggingface/bin/python)
frame #39: PyImport_Cleanup + 0xa0 (0x55c7aa0af5d0 in /root/anaconda3/envs/huggingface/bin/python)
frame #40: Py_FinalizeEx + 0x67 (0x55c7aa12a487 in /root/anaconda3/envs/huggingface/bin/python)
frame #41: <unknown function> + 0x237f03 (0x55c7aa13cf03 in /root/anaconda3/envs/huggingface/bin/python)
frame #42: _Py_UnixMain + 0x3c (0x55c7aa13d22c in /root/anaconda3/envs/huggingface/bin/python)
frame #43: __libc_start_main + 0xf5 (0x7f2f4d63d555 in /usr/lib64/libc.so.6)
frame #44: <unknown function> + 0x1dce90 (0x55c7aa0e1e90 in /root/anaconda3/envs/huggingface/bin/python)
/opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
I also tried using the configuration file "ds_config_zero3.json", and it gives a new error:
nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset.
Traceback (most recent call last):
Traceback (most recent call last):
File "run_libri960.py", line 633, in <module>
File "run_libri960.py", line 633, in <module>
main()
main() File "run_libri960.py", line 484, in main
File "run_libri960.py", line 484, in main
vocab_size=len(processor.tokenizer),vocab_size=len(processor.tokenizer),
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/modeling_utils.py", line 1131, in from_pretrained
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/modeling_utils.py", line 1131, in from_pretrained
model = cls(config, *model_args, **model_kwargs)model = cls(config, *model_args, **model_kwargs)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 976, in __init__
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 976, in __init__
self.wav2vec2 = Wav2Vec2Model(config)self.wav2vec2 = Wav2Vec2Model(config)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 782, in __init__
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 782, in __init__
self.encoder = Wav2Vec2EncoderStableLayerNorm(config) self.encoder = Wav2Vec2EncoderStableLayerNorm(config)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 197, in wrapper
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 197, in wrapper
f(module, *args, **kwargs)
f(module, *args, **kwargs) File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 595, in __init__
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 595, in __init__
self.pos_conv_embed = Wav2Vec2PositionalConvEmbedding(config) self.pos_conv_embed = Wav2Vec2PositionalConvEmbedding(config)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 197, in wrapper
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 197, in wrapper
f(module, *args, **kwargs)f(module, *args, **kwargs)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 200, in __init__
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/transformers-4.6.0.dev0-py3.7.egg/transformers/models/wav2vec2/modeling_wav2vec2.py", line 200, in __init__
self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/utils/weight_norm.py", line 105, in weight_norm
self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/utils/weight_norm.py", line 105, in weight_norm
WeightNorm.apply(module, name, dim)WeightNorm.apply(module, name, dim)
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/utils/weight_norm.py", line 44, in apply
File "/root/anaconda3/envs/huggingface/lib/python3.7/site-packages/torch/nn/utils/weight_norm.py", line 44, in apply
module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data))
module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data))
IndexError: IndexErrorDimension out of range (expected to be in range of [-1, 0], but got 2)
: Dimension out of range (expected to be in range of [-1, 0], but got 2)
Here is the command I executed in the terminal:
deepspeed --include=“localhost:3,4” run_libri960.py
–output_dir={output_dir} \
--num_train_epochs="30" \
--deepspeed={ds_config_dir}
–per_device_train_batch_size=“4”
–per_device_eval_batch_size=“4”
–evaluation_strategy=“steps”
–save_total_limit=“3”
–save_steps=“2000”
–eval_steps=“500”
–logging_steps=“50”
–learning_rate=“3e-5”
–warmup_steps=“500”
–model_name_or_path={model_name_or_path} \
--deepspeed={ds_config_dir}
–preprocessing_num_workers=“32”
–group_by_length
–freeze_feature_extractor
–logging_dir=${logging_dir}
–gradient_accumulation_steps=“2”
I’d appreciate it if you could reply to me!
@patrickvonplaten @valhalla
By the way,I tried using the DDP
to solve the problem of uneven distribution of memory during multi-GPU training
But I find it more likely to prompt OOM
when using DDP, why is that?