How to deal with two heterogeneous training datasets

joadorn · November 11, 2022, 10:52am

I use a “token classification” approach (fine tuning of BertForTokenClassification) to parse bibliographic references in natural language (=extracting the various fields for submission as metadata to publication repositories). I have two training datasets (in IOB format) :

The first one is currently being created manually by annotating references on Doccano. It is very accurate but will not contain more than a few thousand samples at most.
The second one was automatically generated by destructuring already deposited references. The text references are generated with citeproc-py by choosing CSL formats randomly among those of many journals. This set is very large (1 million references) but it is less accurate (there are human errors in the source data) and the formats do not correspond exactly to what the model will have to deal with in reality.

With the second set of data I already reach an accuracy higher than 97% in the validation phase. But I wonder if it would be acceptable to do a second step of fine tuning with the small dataset in order for the model to learn more atypical sentence structures or if I should try to find a way to mix the two sets.

Topic		Replies	Views
Transfer learning (or fine-tuning) pre-trained model on non-text data Beginners	0	432	December 11, 2022
Fine-tune, or train from scratch? Beginners	6	3451	September 16, 2020
How to preprocess dataset with multiple references 🤗Datasets	5	306	July 31, 2023
Questions about my first code on fine-tuning BERT model for text-classification Beginners	0	1507	April 26, 2022
Doccano dataset for named entity recognition task using BERT Beginners	3	464	May 14, 2024

How to deal with two heterogeneous training datasets

Related topics