I have created a large dataset of on-disk sentences I want to classify using a two-label classifier. I’ve reduced the dataset to two columns- “sentence_id” and “text”. My initial desire was to get an output file of sentence IDs with the corresponding label which the model generates.
I could obviously loop over each row manually but my understanding is it’s best practise to use the iterator approach when calling the pipeline but then you don’t have access to the sentence ID. The documentation says:
data()yields each result, and the pipeline automatically recognizes the input is iterable and will start fetching the data while it continues to process it on the GPU (this uses DataLoader under the hood). This is important because you don’t have to allocate memory for the whole dataset and you can feed the GPU as fast as possible.
Is there a way to output the classification output for every row in the dataset as a table or key-value pair including the sentence ID? Thank you
from datasets import load_from_disk from transformers import pipeline from transformers.pipelines.base import KeyDataset ds = load_from_disk("dataset.arrow") def data(): for row in ds: yield row["text"] pipe = pipeline(model="a_classifier_model") for result in pipe(data()): print(result)