I’m training wav2vec2 using transformers on 3 A100 GPU, however the utlization of the GPU are not 100% all the time it goes up and down with every batch.
is this normal ?
using the exact same command here
only by changing -nproc_per_node to 3
python -m torch.distributed.launch --nproc_per_node 3 run_speech_recognition_ctc.py --dataset_name="common_voice" --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" --output_dir="./wav2vec2-common_voice-tr-demo-dist" --overwrite_output_dir --num_train_epochs="15" --per_device_train_batch_size="4" --learning_rate="3e-4" --warmup_steps="500" --evaluation_strategy="steps" --text_column_name="sentence" --length_column_name="input_length" --save_steps="400" --eval_steps="100" --logging_steps="1" --layerdrop="0.0" --save_total_limit="3" --freeze_feature_encoder --gradient_checkpointing --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � --fp16 --group_by_length --do_train --do_eval