I use a “token classification” approach (fine tuning of BertForTokenClassification) to parse bibliographic references in natural language (=extracting the various fields for submission as metadata to publication repositories). I have two training datasets (in IOB format) :
- The first one is currently being created manually by annotating references on Doccano. It is very accurate but will not contain more than a few thousand samples at most.
- The second one was automatically generated by destructuring already deposited references. The text references are generated with citeproc-py by choosing CSL formats randomly among those of many journals. This set is very large (1 million references) but it is less accurate (there are human errors in the source data) and the formats do not correspond exactly to what the model will have to deal with in reality.
With the second set of data I already reach an accuracy higher than 97% in the validation phase. But I wonder if it would be acceptable to do a second step of fine tuning with the small dataset in order for the model to learn more atypical sentence structures or if I should try to find a way to mix the two sets.