Fine Tuning BERT model on custom dataset

I am trying to fine-tune the BERT model on a custom dataset for ner task with PyTorch. I did NOT get the EXACT dataset accepted by transformers example training scripts.
The conll_2003 dataset comes in .txt files that are NOT accepted by training scripts. None of the CSV works well. Can I get a dataset that I can download and use locally without referring to the hub datasets?
Why do datasets are NOT working when we use locally?


The scripts are meant as examples, you can easily tweak them to make it work with your local dataset.

For instance, you can load a HuggingFace Dataset with your local data (as explained in the docs), which could be CSV, JSON, txt, Parquet, etc.

Just make sure that you prepare your data in IOB format (as this is required for the token classification models).

If you can provide me with a small portion, I can illustrate how to make a HuggingFace Dataset with it, and make it work with the script.

1 Like

Here is the example dataset I have prepared for my NER task.

Even though I tried the conll_2003 dataset locally (Which is actually required, to use datasets locally without referring to HF hub). Here is the link to the data that I downloaded and tried. This means to say, if there is something wrong with my custom dataset, it should work with conll_2003. But I got this ERROR when I started with the conell_2003 train.txt , validate.txt from there.

I will be glad to If You help me to get the training done on any dataset, but locally, NOT going to the hub. Thanks

This is the ERROR while using datasets locally.