Classifying entire dataset with IDs

I have created a large dataset of on-disk sentences I want to classify using a two-label classifier. I’ve reduced the dataset to two columns- “sentence_id” and “text”. My initial desire was to get an output file of sentence IDs with the corresponding label which the model generates.

I could obviously loop over each row manually but my understanding is it’s best practise to use the iterator approach when calling the pipeline but then you don’t have access to the sentence ID. The documentation says:

The iterator data() yields each result, and the pipeline automatically recognizes the input is iterable and will start fetching the data while it continues to process it on the GPU (this uses DataLoader under the hood). This is important because you don’t have to allocate memory for the whole dataset and you can feed the GPU as fast as possible.

Is there a way to output the classification output for every row in the dataset as a table or key-value pair including the sentence ID? Thank you

from datasets import load_from_disk
from transformers import pipeline
from transformers.pipelines.base import KeyDataset


ds = load_from_disk("dataset.arrow")

def data():
    for row in ds:
        yield row["text"]

pipe = pipeline(model="a_classifier_model")

for result in pipe(data()):
    print(result)

Hi! You can zip the pipeline output with the input samples since the pipeline preserves the input order.

For the code example, see nlp - Getting the input text from transformers pipeline - Stack Overflow

2 Likes