How to train your own corpus without labels

Hi, I’d like to fine-tune BERT for my product catalog corpus which contains a lot of out-of-vocabulary words like brand names. By fine-tuning, I mean transfer learning and not training from scratch.

I have been following this [Fine-tuning a pretrained model — transformers 4.5.0.dev0 documentation] tutorial and see that this requires labels. As you can imagine, my use case if connected to information retrieval and search and does not contain any y_labels. All I want is unique vector embedding out of my trained model.

How should I approach this problem using HF Trainer module?

hey @awaiskaleem if i understand correctly, i that what you’re looking for is to fine-tune the language model on your corpus. this will generally produce mask-filling that more accurately captures the relations in your corpus and for BERT you can check out the Masked language modeling section of this tutorial: Google Colaboratory

once the language model is fine-tune, you can save the weights and then load them using AutoModel.from_pretrained to generate your embeddings.

1 Like

Thanks! That’s what we were looking for.

1 Like