A new dataset for multi-label text classification

sayakpaul · September 30, 2021, 9:04am

Soumik and I are pleased to share a new NLP dataset for multi-label text classification. The dataset consists of paper titles, abstracts, and term categories scraped from arXiv. Find the dataset on Kaggle: arXiv Paper Abstracts | Kaggle.

We are also releasing our data collection pipeline which is based on Apache Beam that can be run on Cloud Dataflow (GCP) at scale and can be used to accumulate an even bigger dataset at ease. To help the community get started quickly we have authored this blog post that shows how to build a simple baseline model for a smaller version of the dataset.

More details are here: GitHub - soumik12345/multi-label-text-classification

nielsr · September 30, 2021, 9:30am

Cool! Would be great if you upload this dataset to the hub here’s a guide: Share — datasets 1.12.1 documentation

Topic		Replies	Views
Dataset for multilabel classification 🤗Transformers	1	202	January 20, 2025
Fine-Tune for MultiClass or MultiLabel-MultiClass Models	52	69614	May 22, 2023
New dataset added_review for improvement 🤗Datasets	1	529	December 15, 2021
For multi-class text classification, what's the maximum number of labels allowed? 🤗AutoTrain	0	1351	December 17, 2021
Multi Label Zero Shot Classification with Graphs Beginners	1	719	August 8, 2023

A new dataset for multi-label text classification

Related topics