A new dataset for multi-label text classification

Soumik and I are pleased to share a new NLP dataset for multi-label text classification. The dataset consists of paper titles, abstracts, and term categories scraped from arXiv. Find the dataset on Kaggle: arXiv Paper Abstracts | Kaggle.

We are also releasing our data collection pipeline which is based on Apache Beam that can be run on Cloud Dataflow (GCP) at scale and can be used to accumulate an even bigger dataset at ease. To help the community get started quickly we have authored this blog post that shows how to build a simple baseline model for a smaller version of the dataset.

More details are here: GitHub - soumik12345/multi-label-text-classification

Cool! Would be great if you upload this dataset to the hub :slight_smile: here’s a guide: Share — datasets 1.12.1 documentation

1 Like