How to Pretrain XLSR wav2vec on my unlabeled speech data

I want to update the XLSR wav2vec2 weights via unlabeled training data(.wav audios) of my domain. Or you can say that I want to pretrain it in that way it can get exposed to my data before I start it to fine-tune it on label data.

This is the model from hugging face

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53", 
attention_dropout=0.1,
hidden_dropout=0.1,
feat_proj_dropout=0.0,
mask_time_prob=0.05,
layerdrop=0.1,
gradient_checkpointing=True, 
ctc_loss_reduction="mean", 
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)

Please do let me know how can I just use some code to expose XLSR to my unlabeled data as well.
Also when I try to Train the XLSR model on my unlabeled data without any validation data and evaluation measure, it gives me this error
KeyError: ‘loss’

here is my code


from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="/content/drive/MyDrive/wav2vec2-large-xlsr",
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=30,
  fp16=True,
  save_steps=200,
  eval_steps=200,
  logging_steps=200,
  learning_rate=3e-4,
  warmup_steps=300,
  save_total_limit=3,
  do_train=True,
)
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    train_dataset=data,
    tokenizer=processor.feature_extractor,
)
trainer.train()

Thanks in advance

2 Likes

@patrickvonplaten I hope you can help me with this.