Classifying entire dataset with IDs

JFairbairn · June 16, 2023, 12:25pm

I have created a large dataset of on-disk sentences I want to classify using a two-label classifier. I’ve reduced the dataset to two columns- “sentence_id” and “text”. My initial desire was to get an output file of sentence IDs with the corresponding label which the model generates.

I could obviously loop over each row manually but my understanding is it’s best practise to use the iterator approach when calling the pipeline but then you don’t have access to the sentence ID. The documentation says:

The iterator data() yields each result, and the pipeline automatically recognizes the input is iterable and will start fetching the data while it continues to process it on the GPU (this uses DataLoader under the hood). This is important because you don’t have to allocate memory for the whole dataset and you can feed the GPU as fast as possible.

Is there a way to output the classification output for every row in the dataset as a table or key-value pair including the sentence ID? Thank you

from datasets import load_from_disk
from transformers import pipeline
from transformers.pipelines.base import KeyDataset


ds = load_from_disk("dataset.arrow")

def data():
    for row in ds:
        yield row["text"]

pipe = pipeline(model="a_classifier_model")

for result in pipe(data()):
    print(result)

mariosasko · June 19, 2023, 4:08pm

Hi! You can zip the pipeline output with the input samples since the pipeline preserves the input order.

For the code example, see nlp - Getting the input text from transformers pipeline - Stack Overflow

Topic		Replies	Views
Error Iterating over KeyDataset 🤗Datasets	0	30	August 30, 2024
Sentence Pair Classification Intermediate	1	1992	May 4, 2022
idtoLabel argument in "text-classification" pipeline Beginners	1	1095	October 12, 2022
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	5035	June 21, 2023
Error using datasets with pipeline for text generation 🤗Datasets	5	880	December 30, 2024

Classifying entire dataset with IDs

Related topics