What Model and approach should i use for my use case

Hi everyone…! I am handling a project in E-Publication domain where the objective is to identify the elements in my input document (my input will be research article documents) i need my model to correctly identify the elements in the document like which is title, author name, abstract, keywords etc… To achieve this which approach I can use whether I can go with NER or some other approach and how should I build my dataset. Can anyone guide me on this?

If you have some training data, I mean some documents with their elements already identified, you could try if you could finetune some LLM to predict the elements when feeded the documents.

We have tried that in this notebook with OpenAI GPT3; here the curated metadata are the ground-truth data and the generated metadata are the output of the:

Curated metadata:
Author: Mantula, Paula
Supervisor: Docent Satu Mäkelä, Tampere University
Supervisor: Professor Emeritus Jukka Mustonen, Tampere University
Faculty: Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Technology
Opponent: Professor Risto Tertti, University of Turku
Organization: Tampere University
Reviewer: Docent Mari Kanerva , University of Helsinki
Reviewer: Docent Risto Ikäheimo, University of Oulu
Issued: 2021-04-16
ISBN (online): 978-952-03-1880-2
URN: URN:ISBN:978-952-03-1880-2
Language: eng
Publisher: Tampere University
ISSN (online): 2490-0028
ISBN (printed): 978­952­03­1879­6
Series name: Tampere University Dissertations - Tampereen yliopiston väitöskirjat
Number in series: 385
ISSN (printed): 2489-9860
Degree program: Lääketieteen, biotieteiden ja biolääketieteen tekniikan tohtoriohjelma - Doctoral Programme in Medicine, Biosciences and Biomedical Engineering
Title: Role of Urinary Findings and Adipokines in Puumala Virus-induced Acute Kidney Injury
COAR type: doctoral thesis
OKM type: G5 Artikkeliväitöskirja
Thesis level: Artikkeliväitöskirja
---
Generated metadata:

Author: Mantula, Paula
Supervisor: Docent Satu Mäkelä, Tampere University
Faculty: Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Sciences
Opponent: Professor Risto Tertti, University of Turku
Organization: Tampere University
Reviewer: Docent Risto Ikäheimo, Docent Mari Kanerva, University of Oulu
Reviewer: Professor Emeritus Jukka Mustonen, Tampere University
Issued: 2021-03-04
ISBN (online): 978-952-03-1880-2
URN: URN:ISBN:978-952-03-1880-2
Language: eng
Publisher: Tampere University
ISSN (online): 2490-0028
ISBN (printed): 978-952-03-1879-6
Series name: Tampere University Dissertations - Tampereen yliopiston väitöskirjat
Number in series: 385
ISSN (printed): 2489-9860
Degree program: Farmakologian, lääketieteen ja biotieteiden tohtoriohjelma - Doctoral Programme in Medicinal and Clinical Biochemistry
Title: Role of Urinary Findings and Adipokines in Puumala Virus-induced Acute Kidney Injury
COAR type: doctoral thesis
OKM type: G4 Monografiaväitöskirja
Thesis level: Monografiaväitöskirja

In the repo there are also other experiments with other LLMs. The results table shows that also small, open LLMs can produce as good as or even better results than GPT3.

There is also a notebook that uses just prompting: GPT 3/4 is asked to extract metadata from documents. While this produces quite good results, it is really slow and expensive.

But actually if all your documents are research articles, maybe papermage could work, see a demo.