I was trying to pretrain the word2vec2.0 using this file. However, I get the following error when I reach the training phase:
AttributeError Traceback (most recent call last)
<ipython-input-38-9c63e3c0d6e0> in <module>()
5 for epoch in range(starting_epoch, num_train_epochs):
6 model.train()
----> 7 for step, batch in enumerate(train_dataloader):
8 # compute num of losses
9 num_losses = batch["mask_time_indices"].sum()
5 frames
/usr/local/lib/python3.7/dist-packages/accelerate/data_loader.py in __iter__(self)
328 # We iterate one batch ahead to check when we are at the end
329 try:
--> 330 current_batch = next(dataloader_iter)
331 except StopIteration:
332 yield
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
473 def _next_data(self):
474 index = self._next_index() # may raise StopIteration
--> 475 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
476 if self._pin_memory:
477 data = _utils.pin_memory.pin_memory(data)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)
<ipython-input-5-e1d5eabaa1e8> in __call__(self, features)
38 padding=self.padding,
39 pad_to_multiple_of=self.pad_to_multiple_of,
---> 40 return_tensors="pt",
41 )
42
/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_sequence_utils.py in pad(self, processed_features, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors)
219 if key not in batch_outputs:
220 batch_outputs[key] = []
--> 221 if value.dtype is np.dtype(np.float64):
222 value = value.astype(np.float32)
223 batch_outputs[key].append(value)
AttributeError: 'str' object has no attribute 'dtype'
I am running the code from run_wav2vec2_pretraining_no_trainer.py in a jupyter notebook on colab for testing purpose. When I tried printing the key and value mentioned in the above code and got key = Path, and value=\path\to\mp3\file\in\dataset for the mp3 file in my custom dataset.
Hi @omar47. I’m not sure we have the same original issue.
I see two alternative issues that may cause this:
Passing the class Wav2Vec2FeatureExtractor to DataCollatorForWav2Vec2Pretraining. Solution: instantiate the feature extractor before passing it to the data collator instance:
The dataset is still a dictionary:
This problem appeared again for me because I did not remove unused columns in the preprocessing step (using prepare_dataset). Because of this the data is still in dictionary form, which I think is not expected by the padding function in the Data Collator. Make sure that you keep the line remove_columns=raw_datasets["train"].column_names when mapping the prepare_dataset function to your dataset:
Okay, That did resolve that error. Now I am seeing the following error:
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 910, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 397, in simple_launcher
process = subprocess.Popen(cmd, env=current_env)
File "/usr/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1462, in _execute_child
env_list.append(k + b'=' + os.fsencode(v))
File "/usr/lib/python3.7/os.py", line 812, in fsencode
filename = fspath(filename) # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType
I tried running the code by copy pasting scripts into a colab notebook. I was unable to access the values in train_dataloader. It gives the same error when I try to convert train_dataloader to a list. So maybe this didn’t resolve the error. I had edited the code to use model and feature_extractor from pretrained.
I haven’t seen the error you post above before, but a quick google gives me two other similar issues where Google Colab with a new accelerate update may be the issue: this one and this one.
For the firstone, the solution was to downgrade accelerate to version 0.12.0 (pip install accelerate==0.12.0). If you try this, make sure create a new virtual environment before downgrading, as I am unaware of if other packages in Huggingface are dependent on accelerate > 0.12.0.