NLP for Summarization and classification

MrTehIced · January 22, 2025, 12:50pm

Hi all… I am currently trying to build a text summarization and classification NLP project for research papers. I went through the tutorials on HuggingFace and I still find myself very lost and stuck. As of now, I have downloaded a few papers that I intend to use for training and testing of a model. But the issue I am facing now is, how do I form a dataset out of these papers and load it in to the model that I want to use. Any advice is greatly appreciated. Thank you!!

John6666 · January 22, 2025, 1:09pm

There are probably countless methods, including manual ones, but I don’t think there is yet a single established method…
It’s a topic that could be the subject of research in itself, and I don’t think it’s unreasonable to get stuck there…
There are many papers on methods, and there also seems to be a publicly available dataset for scientific paper classification tasks.
https://www.nature.com/articles/s41467-024-45914-8

Alanturner2 · January 22, 2025, 1:31pm

Hi, @MrTehIced !
I appreciate your activity. If you want to develop your own model using your own data you can use this style code.

import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''.join(page.extract_text() for page in reader.pages)
    return text

And the data type should be

[
    {
        "text": "Full text of the paper...",
        "summary": "Paper summary...",
        "label": "Topic label..."
    }
]

You can load your data:

from datasets import Dataset

data = [
    {"text": "Full text of the paper 1...", "summary": "Summary 1...", "label": "Label 1"},
    {"text": "Full text of the paper 2...", "summary": "Summary 2...", "label": "Label 2"}
]
dataset = Dataset.from_list(data)

You can use pretrained models such as t5-small or bart-large-cnn. But if you want to get better result on research papers how about using arxiv summarize model? You can search that model in huggingface hub. I don’t remember the exact name but its model’s context size is more than 16800.

And then if I were you, not fine tune the model. I will use openai api instead fine tuning.
You can also openai and langchain to chat with your data and get summarization. This project will help you.

MrTehIced · January 22, 2025, 2:19pm

Hey Alan! Thank you for your reply. By data type, how do I convert it into such a format that you have listed above? I tried many ways, exporting to txt, exporting to json, CSV, but I am not too sure as to how I can fit into the format that you have shown.

Thank you!

MrTehIced · January 22, 2025, 2:20pm

Hi John,

Thank you for the reply and providing the research journal. I will read it as soon as I can.

Thank you!

Topic		Replies	Views
Nlp 0.3.0 is out! 🤗Datasets	3	838	July 8, 2020
Preparing datasets for NLP tasks 🤗Datasets	1	543	July 28, 2021
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	499	February 2, 2024
How do I create Datasets from PDF files? Beginners	5	1253	January 17, 2025
Train model from scratch on own dataset Beginners	0	575	February 26, 2024

NLP for Summarization and classification

Related topics