stas
May 10, 2021, 4:40pm
17
Here is WIP PR that makes it work with deepspeed:
huggingface:master
← stas00:ds-inputs-dtype
opened 10:29PM - 07 May 21 UTC
Addressing the need in https://github.com/huggingface/transformers/issues/11446,… this PR is working on making wav2vec2 work under deepspeed.
This PR:
* changes Trainer to automatically convert inputs to the correct dtype if it's not int64 - we didn't need this for nlp models because embeddings took care of this - this is not the case with wav2vec2 type of models where inputs are float32 by default. (for deepspeed only at the moment - potentially need to do the same for `fp16_full_eval`)
* multiple fixes to the `wav2vec2` model, because it does very non-standard things, like model `weight_norm` which is implemented in a very odd way and deepspeed's automatic ease-of-use fails to do that and requires multiple manual adjustments for it to do the right thing. `weight_norm` creates a param, then drops it replacing it with 2 other params and re-creates them on every forward. (in pre-hook).
* moves `require_deepspeed` to `testing_utils.py` as we have multiple test files using it
* adds `dtype` accessor to DS conf object
* adds 8 new tests, checking each setup
Testing with `run_asr.py`:
### ZeRO-2
Everything works:
* [x] fp16 distributed zero2
* [x] fp16 non distributed zero2
* [x] fp32 distributed zero2
* [x] fp32 non distributed zero2
important - must use for distributed use:
```
"zero_optimization": {
"find_unused_parameters": true,
```
So you can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json` which already has the adjustment.
### ZeRO-3
You can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero3.json`
This works:
* [x] fp32 non distributed zero3
* [x] fp32 distributed zero3
* [x] fp16 non distributed zero3
* [x] fp16 distributed zero3
### Possible PR spin-offs
it looks like plain pytorch dist doesn't work either https://github.com/huggingface/transformers/issues/11452
so this PR can be adapted to detect `dist` and do the same as what deepspeed branch does. probably a separate PR is the best.
(LayerSkip that is)
---------------------
To run tests:
Install this deepspeed master https://github.com/microsoft/DeepSpeed:
```
pip install deepspeed
```
and then:
```
HF_DATASETS_IN_MEMORY_MAX_SIZE=0 RUN_SLOW=1 pyt examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py
```
---------------------
Example of usage: assuming you in a top dir of the git clone of this branch
### `run_asr.py` and tiny model and tiny dataset
This is the foundation for the new tests:
```
HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH=src deepspeed --num_gpus 2 \
examples/research_projects/wav2vec2/run_asr.py \
--output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 --evaluation_strategy=steps --save_steps=500 --eval_steps=100 \
--logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \
--model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \
--dataset_name=patrickvonplaten/librispeech_asr_dummy --dataset_config_name=clean \
--train_split_name=validation --validation_split_name=validation --orthography=timit \
--preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \
--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json
```
### run_common_voice.py
very hard to test with as it takes some 5-10mins to just get ready to run.
**edit**: switch to `datasets` master branch and add `HF_DATASETS_IN_MEMORY_MAX_SIZE=0` to the command line - it will be cached now.
`run_common_voice.py` now runs under `--fp16` but gives `loss=nan`, probably the same issue as bf16-pretrained models? I tested - it has the same issue under AMP and no deepspeed. So it's a different problem to solve.
fp32 works just fine loss-wise, you can try:
```
HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH="src" deepspeed --num_gpus=1 \
examples/research_projects/wav2vec2/run_common_voice.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \
--output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \
--per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \
--evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \
--save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \
--gradient_checkpointing --group_by_length --do_train --do_eval --deepspeed \
examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json
```
Thanks to @patrickvonplaten for making small wav2vec2 models which helped a ton to debug faster and they were needed for the tests.
## Requirements to merge this PR
- [x] https://github.com/microsoft/DeepSpeed/pull/1135
- [x] deepspeed version requirement bumped to 0.4.0
- [x] deepspeed 0.4.0 released
Fixes: https://github.com/huggingface/transformers/issues/11446
wav2vec2 has 2 peculiarities:
it randomly skips layers! which I think is what requires find_unused_parameters
- in normal dist and also in zero-2. for zero-3 we must run all gpus in sync, so this problem is removed. (see PR)
it uses weight_norm
which re-creates 2 params in pre-forward which also has all kinds of potential side-effects. I am attempting to write a fused version of weight_norm
+ Conv1d
which doesn’t use any tricks, but I haven’t fully sorted it out yet.
here is a work in progress:
import torch.nn as nn
from torch.nn.parameter import Parameter
from torch import _weight_norm, norm_except_dim
class Conv1dWithWeightNorm(nn.Conv1d):
def __init__(self, *args, **kwargs):
super(Conv1dWithWeightNorm, self).__init__(*args, **kwargs)
self.dim = 2
import deepspeed
with deepspeed.zero.GatheredParameters(self.weight):
weight = self.weight
self.weight_g = Parameter(norm_except_dim(weight, 2, self.dim).data)
self.weight_v = Parameter(weight.data)
del self._parameters["weight"]
self.weight = _weight_norm(self.weight_v, self.weight_g, self.dim)
print(self.weight)
def compute_weight(self):
self.weight_g = Parameter(norm_except_dim(self.weight, 2, self.dim).data)
self.weight_v = Parameter(self.weight.data)
return _weight_norm(self.weight_v, self.weight_g, self.dim)
def forward(self, input):
self.weight = self.compute_weight()
return self._conv_forward(input, self.weight, self.bias)
class Wav2Vec2PositionalConvEmbedding(nn.Module):
def __init__(self, config):
super().__init__()
self.conv = Conv1dWithWeightNorm(
in_channels=config.hidden_size,
out_channels=config.hidden_size,
kernel_size=config.num_conv_pos_embeddings,
padding=config.num_conv_pos_embeddings // 2,
groups=config.num_conv_pos_embedding_groups,
)
self.padding = Wav2Vec2SamePadLayer(config.num_conv_pos_embeddings)
self.activation = ACT2FN[config.feat_extract_activation]
def forward(self, hidden_states):
hidden_states = hidden_states.transpose(1, 2)
hidden_states = self.conv(hidden_states)
hidden_states = self.padding(hidden_states)
hidden_states = self.activation(hidden_states)
hidden_states = hidden_states.transpose(1, 2)
return hidden_states
3 Likes