What Model and approach should i use for my use case

Banuchander · May 20, 2024, 11:35am

Hi everyone…! I am handling a project in E-Publication domain where the objective is to identify the elements in my input document (my input will be research article documents) i need my model to correctly identify the elements in the document like which is title, author name, abstract, keywords etc… To achieve this which approach I can use whether I can go with NER or some other approach and how should I build my dataset. Can anyone guide me on this?

juhoinkinen · May 20, 2024, 5:01pm

If you have some training data, I mean some documents with their elements already identified, you could try if you could finetune some LLM to predict the elements when feeded the documents.

We have tried that in this notebook with OpenAI GPT3; here the curated metadata are the ground-truth data and the generated metadata are the output of the:

Curated metadata:
Author: Mantula, Paula
Supervisor: Docent Satu Mäkelä, Tampere University
Supervisor: Professor Emeritus Jukka Mustonen, Tampere University
Faculty: Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Technology
Opponent: Professor Risto Tertti, University of Turku
Organization: Tampere University
Reviewer: Docent Mari Kanerva , University of Helsinki
Reviewer: Docent Risto Ikäheimo, University of Oulu
Issued: 2021-04-16
ISBN (online): 978-952-03-1880-2
URN: URN:ISBN:978-952-03-1880-2
Language: eng
Publisher: Tampere University
ISSN (online): 2490-0028
ISBN (printed): 9789520318796
Series name: Tampere University Dissertations - Tampereen yliopiston väitöskirjat
Number in series: 385
ISSN (printed): 2489-9860
Degree program: Lääketieteen, biotieteiden ja biolääketieteen tekniikan tohtoriohjelma - Doctoral Programme in Medicine, Biosciences and Biomedical Engineering
Title: Role of Urinary Findings and Adipokines in Puumala Virus-induced Acute Kidney Injury
COAR type: doctoral thesis
OKM type: G5 Artikkeliväitöskirja
Thesis level: Artikkeliväitöskirja
---
Generated metadata:

Author: Mantula, Paula
Supervisor: Docent Satu Mäkelä, Tampere University
Faculty: Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Sciences
Opponent: Professor Risto Tertti, University of Turku
Organization: Tampere University
Reviewer: Docent Risto Ikäheimo, Docent Mari Kanerva, University of Oulu
Reviewer: Professor Emeritus Jukka Mustonen, Tampere University
Issued: 2021-03-04
ISBN (online): 978-952-03-1880-2
URN: URN:ISBN:978-952-03-1880-2
Language: eng
Publisher: Tampere University
ISSN (online): 2490-0028
ISBN (printed): 978-952-03-1879-6
Series name: Tampere University Dissertations - Tampereen yliopiston väitöskirjat
Number in series: 385
ISSN (printed): 2489-9860
Degree program: Farmakologian, lääketieteen ja biotieteiden tohtoriohjelma - Doctoral Programme in Medicinal and Clinical Biochemistry
Title: Role of Urinary Findings and Adipokines in Puumala Virus-induced Acute Kidney Injury
COAR type: doctoral thesis
OKM type: G4 Monografiaväitöskirja
Thesis level: Monografiaväitöskirja

In the repo there are also other experiments with other LLMs. The results table shows that also small, open LLMs can produce as good as or even better results than GPT3.

There is also a notebook that uses just prompting: GPT 3/4 is asked to extract metadata from documents. While this produces quite good results, it is really slow and expensive.

juhoinkinen · May 20, 2024, 5:06pm

But actually if all your documents are research articles, maybe papermage could work, see a demo.

Topic		Replies	Views
What model(s) to use? Beginners	0	239	April 24, 2023
I need help with how to approach my project Beginners	0	226	January 24, 2024
Good pre-trained models for Document Answering tasks? Beginners	3	4786	February 20, 2024
Multilingual NER Extraction Models	1	540	July 13, 2022
Entity Relationship Modeling Beginners	2	1044	January 9, 2021

What Model and approach should i use for my use case

Related topics