NLP for Summarization and classification

Hi all… I am currently trying to build a text summarization and classification NLP project for research papers. I went through the tutorials on HuggingFace and I still find myself very lost and stuck. As of now, I have downloaded a few papers that I intend to use for training and testing of a model. But the issue I am facing now is, how do I form a dataset out of these papers and load it in to the model that I want to use. Any advice is greatly appreciated. Thank you!!

1 Like

There are probably countless methods, including manual ones, but I don’t think there is yet a single established method…
It’s a topic that could be the subject of research in itself, and I don’t think it’s unreasonable to get stuck there…
There are many papers on methods, and there also seems to be a publicly available dataset for scientific paper classification tasks.
https://www.nature.com/articles/s41467-024-45914-8

Hi, @MrTehIced !
I appreciate your activity. If you want to develop your own model using your own data you can use this style code.

import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''.join(page.extract_text() for page in reader.pages)
    return text

And the data type should be

[
    {
        "text": "Full text of the paper...",
        "summary": "Paper summary...",
        "label": "Topic label..."
    }
]

You can load your data:

from datasets import Dataset

data = [
    {"text": "Full text of the paper 1...", "summary": "Summary 1...", "label": "Label 1"},
    {"text": "Full text of the paper 2...", "summary": "Summary 2...", "label": "Label 2"}
]
dataset = Dataset.from_list(data)

You can use pretrained models such as t5-small or bart-large-cnn. But if you want to get better result on research papers how about using arxiv summarize model? You can search that model in huggingface hub. I don’t remember the exact name but its model’s context size is more than 16800.

And then if I were you, not fine tune the model. I will use openai api instead fine tuning.
You can also openai and langchain to chat with your data and get summarization. This project will help you.

1 Like

Hey Alan! Thank you for your reply. By data type, how do I convert it into such a format that you have listed above? I tried many ways, exporting to txt, exporting to json, CSV, but I am not too sure as to how I can fit into the format that you have shown.

Thank you!

1 Like

Hi John,

Thank you for the reply and providing the research journal. I will read it as soon as I can.

Thank you!

1 Like