I want to update the XLSR wav2vec2 weights via unlabeled training data(.wav audios) of my domain. Or you can say that I want to pretrain it in that way it can get exposed to my data before I start it to fine-tune it on label data.
This is the model from hugging face
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53",
attention_dropout=0.1,
hidden_dropout=0.1,
feat_proj_dropout=0.0,
mask_time_prob=0.05,
layerdrop=0.1,
gradient_checkpointing=True,
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer)
Please do let me know how can I just use some code to expose XLSR to my unlabeled data as well.
Also when I try to Train the XLSR model on my unlabeled data without any validation data and evaluation measure, it gives me this error
KeyError: âlossâ
here is my code
from transformers import Trainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="/content/drive/MyDrive/wav2vec2-large-xlsr",
group_by_length=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=30,
fp16=True,
save_steps=200,
eval_steps=200,
logging_steps=200,
learning_rate=3e-4,
warmup_steps=300,
save_total_limit=3,
do_train=True,
)
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
train_dataset=data,
tokenizer=processor.feature_extractor,
)
trainer.train()
Thanks in advance