Ninja error with very large dataset using wav2vec2

When trying to fine-tuning the wav2vec2 model (speech recognition), the training halts by “ninja” error with very large datasets.
The dataset I used is more than 60GB and consists of 100,000 wav files.
This problem happens to deepspeed as well with even smaller datasets (>=30GB, 500,000 wav files).
(Smaller datasets (>=30GB, 500,000 wav files) work fine with a single GPU).
The machine spec is:

  • Memory 126GB
  • 2GB swp
  • 16 core CPU
  • GPU: Tesla V100 x 4
  • Ubuntu 18.04
  • transformers version: 4.6.0.dev0
  • Platform: Linux-4.15.0-140-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.8.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: YES
  • Using distributed or parallel set-up in script?: Yes (but the issue happens with or without distributed set-up)

I made sure that ninja is actually installed by apt install and source code install.
It does not seem to be an actual ninja error because it works fine with smaller datasets.
The error message is as below.
Any help would be very appreciated!

❯ python finetuning.py config/experiments/laboro_train_clean_100000.json
None
loading costom configurations from /home/tomihira/wav2vec2_finetuning/config/experiments/laboro_train_clean_100000.json
[2021-06-01 11:14:43,888] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2021-06-01 11:14:47,274] [INFO] [distributed.py:89:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.0.3, master_port=29500
[2021-06-01 11:14:47,274] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
INFO:__main__:Training/evaluation parameters TrainingArguments(output_dir=./outputs/laboro_train_clean_100000, overwrite_output_dir=true, do_train=true, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.STEPS, predictio
n_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.0003, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max
_grad_norm=1.0, num_train_epochs=30, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=500, logging_dir=runs/Jun01_11-14-43_t1, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_
steps=400, save_strategy=IntervalStrategy.STEPS, save_steps=400, save_total_limit=1, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False
, debug=False, dataloader_drop_last=False, eval_steps=400, dataloader_num_workers=0, past_index=-1, run_name=./outputs/laboro_train_clean_100000, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=Fal
se, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=./config/deepspeed.json, label_smoothing_factor=0.0, adafactor=False, group_by_length=true, length_column_name=length, report_to=['t
ensorboard', 'wandb'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, _n_gpu=1, mp_parameters=)
loading configuration file https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/config.json from cache at /home/tomihira/.cache/huggingface/transformers/8508c73cd595eb416a1d517b90762416c0bc6cfbef529578079aeae4d8c14336.9c165
5e075d9ef07cda724db675f9067777f6eb2dd14269a834fcde8a48e825a
Model config Wav2Vec2Config {
  "activation_dropout": 0.1,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2Model"
  ],
  "attention_dropout": 0.2,
  "bos_token_id": 1,
  "conv_bias": true,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "mean",
  "ctc_zero_infinity": false,
  "do_stable_layer_norm": true,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "layer",
  "feat_proj_dropout": 0.05,
  "final_dropout": 0.0,
  "gradient_checkpointing": "true",
  "hidden_act": "gelu",
  "hidden_dropout": 0.05,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "layerdrop": 0.04,
  "mask_channel_length": 10,
  "mask_channel_min_space": 1,
  "mask_channel_other": 0.0,
  "mask_channel_prob": 0.0,
  "mask_channel_selection": "static",
  "mask_feature_length": 10,
  "mask_feature_prob": 0.0,
  "mask_time_length": 10,
  "mask_time_min_space": 1,
  "mask_time_other": 0.0,
  "mask_time_prob": 0.05,
  "mask_time_selection": "static",
  "model_type": "wav2vec2",
  "num_attention_heads": 16,
  "num_conv_pos_embedding_groups": 16,
  "num_conv_pos_embeddings": 128,
  "num_feat_extract_layers": 7,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "transformers_version": "4.6.0.dev0",
  "vocab_size": 88
}

loading weights file https://huggingface.co/facebook/wav2vec2-large-xlsr-53/resolve/main/pytorch_model.bin from cache at /home/tomihira/.cache/huggingface/transformers/5d2a20b45a1689a376ec4a6282b9d9be42f931cdf8daf07c3668ba1070a059d9.db2a6
9eb44bf7b1efcfff155d4cc22155230bd8c0941701b064e9c17429a623d
All model checkpoint weights were used when initializing Wav2Vec2ForCTC.

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Configuration saved in ./outputs/laboro_train_clean_100000/preprocessor_config.json
tokenizer config file saved in ./outputs/laboro_train_clean_100000/tokenizer_config.json
Special tokens file saved in ./outputs/laboro_train_clean_100000/special_tokens_map.json                                                                                                                                            [138/1883]
Configuration saved in ./outputs/laboro_train_clean_100000/preprocessor_config.json
tokenizer config file saved in ./outputs/laboro_train_clean_100000/tokenizer_config.json
Special tokens file saved in ./outputs/laboro_train_clean_100000/special_tokens_map.json
[2021-06-01 12:08:24,420] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.16+c4f4ef5, git-hash=c4f4ef5, git-branch=master
[2021-06-01 12:08:31,758] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-06-01 12:08:32,061] [INFO] [engine.py:602:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-06-01 12:08:32,061] [INFO] [engine.py:606:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-06-01 12:08:32,061] [INFO] [engine.py:616:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2021-06-01 12:08:32,062] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 unfused optimizer with dynamic loss scale
[2021-06-01 12:08:32,063] [INFO] [unfused_optimizer.py:37:__init__] Fused Lamb Legacy : False
[2021-06-01 12:08:32,367] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2021-06-01 12:08:32,368] [INFO] [engine.py:444:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-06-01 12:08:32,368] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f15c9029f90>
[2021-06-01 12:08:32,369] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2021-06-01 12:08:32,369] [INFO] [config.py:747:print] DeepSpeedEngine configuration:
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   allreduce_always_fp32 ........ False
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   amp_enabled .................. False
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   amp_params ................... False
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   checkpoint_tag_validation_enabled  True
[2021-06-01 12:08:32,381] [INFO] [config.py:751:print]   checkpoint_tag_validation_fail  False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   disable_allgather ............ False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   dump_state ................... False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   elasticity_enabled ........... False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
}
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   fp16_enabled ................. auto
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   global_rank .................. 0
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 1
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   gradient_clipping ............ 1.0
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 4294967296
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   loss_scale ................... 0
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   memory_breakdown ............. False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
[2021-06-01 12:08:32,382] [INFO] [config.py:751:print]   optimizer_name ............... None
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   optimizer_params ............. None
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   pld_enabled .................. False
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   pld_params ................... False
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   prescale_gradients ........... False
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   scheduler_name ............... None
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   scheduler_params ............. None
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   sparse_attention ............. None
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   steps_per_print .............. 100
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   tensorboard_output_path ......
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   train_batch_size ............. 8
[2021-06-01 12:08:32,383] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  8
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   wall_clock_breakdown ......... false
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   world_size ................... 1                                                                                                                                            [71/1883]
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  True
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   zero_config .................. {
    "stage": 0,
    "contiguous_gradients": false,
    "reduce_scatter": false,
    "reduce_bucket_size": 5.000000e+08,
    "allgather_partitions": true,
    "allgather_bucket_size": 5.000000e+08,
    "overlap_comm": false,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": true,
    "offload_param": null,
    "offload_optimizer": null,
    "sub_group_size": 1.000000e+12,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_fp16_weights_on_model_save": false,
    "find_unused_parameters": false
}
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   zero_enabled ................. False
[2021-06-01 12:08:32,384] [INFO] [config.py:751:print]   zero_optimization_stage ...... 0
[2021-06-01 12:08:32,384] [INFO] [config.py:758:print]   json = {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "opt_level": "O3"
    },
    "steps_per_print": 100,
    "wall_clock_breakdown": "false",
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true
}
Using /home/tomihira/.cache/torch_extensions as PyTorch extensions root...
Traceback (most recent call last):
  File "finetuning.py", line 325, in <module>
    main(args)
  File "finetuning.py", line 294, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tomihira/workspace/transformers/src/transformers/trainer.py", line 1067, in train
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/tomihira/workspace/transformers/src/transformers/integrations.py", line 519, in deepspeed_init
    lr_scheduler=lr_scheduler,
  File "/home/tomihira/workspace/DeepSpeed/deepspeed/__init__.py", line 130, in initialize
    config_params=config_params)
  File "/home/tomihira/workspace/DeepSpeed/deepspeed/runtime/engine.py", line 198, in __init__
    util_ops = UtilsBuilder().load()
  File "/home/tomihira/workspace/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 215, in load
    return self.jit_load(verbose)
  File "/home/tomihira/workspace/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 252, in jit_load
    verbose=verbose)
  File "/home/tomihira/.conda/envs/hf/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1091, in load
    keep_intermediates=keep_intermediates)
  File "/home/tomihira/.conda/envs/hf/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1302, in _jit_compile
    is_standalone=is_standalone)
  File "/home/tomihira/.conda/envs/hf/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/tomihira/.conda/envs/hf/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
1 Like