Hi all… I am currently trying to build a text summarization and classification NLP project for research papers. I went through the tutorials on HuggingFace and I still find myself very lost and stuck. As of now, I have downloaded a few papers that I intend to use for training and testing of a model. But the issue I am facing now is, how do I form a dataset out of these papers and load it in to the model that I want to use. Any advice is greatly appreciated. Thank you!!
There are probably countless methods, including manual ones, but I don’t think there is yet a single established method…
It’s a topic that could be the subject of research in itself, and I don’t think it’s unreasonable to get stuck there…
There are many papers on methods, and there also seems to be a publicly available dataset for scientific paper classification tasks.
https://www.nature.com/articles/s41467-024-45914-8
Hi, @MrTehIced !
I appreciate your activity. If you want to develop your own model using your own data you can use this style code.
import PyPDF2
def extract_text_from_pdf(pdf_path):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''.join(page.extract_text() for page in reader.pages)
return text
And the data type should be
[
{
"text": "Full text of the paper...",
"summary": "Paper summary...",
"label": "Topic label..."
}
]
You can load your data:
from datasets import Dataset
data = [
{"text": "Full text of the paper 1...", "summary": "Summary 1...", "label": "Label 1"},
{"text": "Full text of the paper 2...", "summary": "Summary 2...", "label": "Label 2"}
]
dataset = Dataset.from_list(data)
You can use pretrained models such as t5-small or bart-large-cnn. But if you want to get better result on research papers how about using arxiv summarize model? You can search that model in huggingface hub. I don’t remember the exact name but its model’s context size is more than 16800.
And then if I were you, not fine tune the model. I will use openai api instead fine tuning.
You can also openai and langchain to chat with your data and get summarization. This project will help you.
Hey Alan! Thank you for your reply. By data type, how do I convert it into such a format that you have listed above? I tried many ways, exporting to txt, exporting to json, CSV, but I am not too sure as to how I can fit into the format that you have shown.
Thank you!
Hi John,
Thank you for the reply and providing the research journal. I will read it as soon as I can.
Thank you!