AttributeError: 'str' object has no attribute 'dtype' when pretraining wav2vec2

omar47 · August 1, 2022, 12:56pm

I was trying to pretrain the word2vec2.0 using this file. However, I get the following error when I reach the training phase:

AttributeError                            Traceback (most recent call last)
<ipython-input-38-9c63e3c0d6e0> in <module>()
      5 for epoch in range(starting_epoch, num_train_epochs):
      6     model.train()
----> 7     for step, batch in enumerate(train_dataloader):
      8         # compute num of losses
      9         num_losses = batch["mask_time_indices"].sum()

5 frames
/usr/local/lib/python3.7/dist-packages/accelerate/data_loader.py in __iter__(self)
    328         # We iterate one batch ahead to check when we are at the end
    329         try:
--> 330             current_batch = next(dataloader_iter)
    331         except StopIteration:
    332             yield

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

<ipython-input-5-e1d5eabaa1e8> in __call__(self, features)
     38             padding=self.padding,
     39             pad_to_multiple_of=self.pad_to_multiple_of,
---> 40             return_tensors="pt",
     41         )
     42 

/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_sequence_utils.py in pad(self, processed_features, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors)
    219                 if key not in batch_outputs:
    220                     batch_outputs[key] = []
--> 221                 if value.dtype is np.dtype(np.float64):
    222                     value = value.astype(np.float32)
    223                 batch_outputs[key].append(value)

AttributeError: 'str' object has no attribute 'dtype'

I am running the code from run_wav2vec2_pretraining_no_trainer.py in a jupyter notebook on colab for testing purpose. When I tried printing the key and value mentioned in the above code and got key = Path, and value=\path\to\mp3\file\in\dataset for the mp3 file in my custom dataset.

Does anyone know whats going on here? @patrickvonplaten

mpierrau · October 18, 2022, 1:13pm

I am facing the same issue. Did you find a solution/explanation?

mpierrau · October 19, 2022, 1:27pm

Okay, I figured it out.

In my case it originated from passing the class Wav2Vec2FeatureExtractor to DataCollatorForWav2Vec2Pretraining instead of an instance of that class.

Make sure the feature extractor is initialized before passing it to the data collator.

omar47 · October 25, 2022, 5:02pm

@mpierrau can you please share the solution?

mpierrau · October 27, 2022, 9:15am

Hi @omar47. I’m not sure we have the same original issue.

I see two alternative issues that may cause this:

Passing the class Wav2Vec2FeatureExtractor to DataCollatorForWav2Vec2Pretraining. Solution: instantiate the feature extractor before passing it to the data collator instance:

    model = Wav2Vec2ForPreTraining(args.model_path)
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(args.model_path)

    data_collator = DataCollatorForWav2Vec2Pretraining(
        model=model, 
        feature_extractor=feature_extractor
    )

The dataset is still a dictionary:
This problem appeared again for me because I did not remove unused columns in the preprocessing step (using prepare_dataset). Because of this the data is still in dictionary form, which I think is not expected by the padding function in the Data Collator. Make sure that you keep the line remove_columns=raw_datasets["train"].column_names when mapping the prepare_dataset function to your dataset:

vectorized_datasets = raw_datasets.map(
            prepare_dataset,
            num_proc=args.preprocessing_num_workers,
            remove_columns=raw_datasets["train"].column_names,
            cache_file_names=cache_file_names,
        )

These are the two things that I could identify that alleviated the issue in my case. Good luck!

omar47 · October 31, 2022, 3:49pm

Okay, That did resolve that error. Now I am seeing the following error:

Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 910, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 397, in simple_launcher
    process = subprocess.Popen(cmd, env=current_env)
  File "/usr/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.7/subprocess.py", line 1462, in _execute_child
    env_list.append(k + b'=' + os.fsencode(v))
  File "/usr/lib/python3.7/os.py", line 812, in fsencode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType

omar47 · October 31, 2022, 5:46pm

I tried running the code by copy pasting scripts into a colab notebook. I was unable to access the values in train_dataloader. It gives the same error when I try to convert train_dataloader to a list. So maybe this didn’t resolve the error. I had edited the code to use model and feature_extractor from pretrained.

mpierrau · November 1, 2022, 1:31pm

Hey @omar47,

Glad to see my previous reply was of help.

I haven’t seen the error you post above before, but a quick google gives me two other similar issues where Google Colab with a new accelerate update may be the issue: this one and this one.

For the firstone, the solution was to downgrade accelerate to version 0.12.0 (pip install accelerate==0.12.0). If you try this, make sure create a new virtual environment before downgrading, as I am unaware of if other packages in Huggingface are dependent on accelerate > 0.12.0.

Topic		Replies	Views
Wav2VecForPreTraining - Not able to run trainer.train() Beginners	3	691	October 19, 2021
Getting this 'AttributeError: 'list' object has no attribute 'get'' error when trying to fine tune wav2vec2 model 🤗Transformers	0	827	January 31, 2024
How to train Wav2Vec2 in LoRA? Models	1	1387	November 19, 2023
Wav2Vec2 pretraining feature extraction during preprocessing as welll as training 🤗Transformers	1	751	October 1, 2022
TypeError: '<' not supported between instances of 'NoneType' and 'int' while training wav2vec2 🤗Transformers	1	2578	October 27, 2024

AttributeError: 'str' object has no attribute 'dtype' when pretraining wav2vec2

Related topics