KeyError: 'csv' using a csv file with KeyDataset

I am trying to use the model robert-base-openai-detector from huggingface.

I get keyError csv when I try to use a csv file. Whenever I use a text file and change load_dataset and KeyDataset to text it works.

Do I need to do some preprocessing before or what am I doing wrong?

pipe = pipeline("text-classification", model="roberta-base-openai-detector")
dataset = load_dataset('csv', data_files=["/content/training testing split with college address.xlsx - Sheet1.csv"], split='train')

for out in tqdm(pipe(KeyDataset(dataset, "csv"))):
    print (out)

The error stack doesn’t give me much information.

KeyError                                  Traceback (most recent call last)
<ipython-input-5-98a64dbb56de> in <cell line: 10>()
      8 
      9 
---> 10 for out in tqdm(pipe(KeyDataset(dataset, "csv"))):
     11     print (out)
     12 

8 frames
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/pt_utils.py in __getitem__(self, i)
    303 
    304     def __getitem__(self, i):
--> 305         return self.dataset[i][self.key]
    306 
    307 

KeyError: 'csv'

load_dateset will accept the dataset name which is either uploaded in the hugging face datasets or you pass to a directory where it will process the file in some format which contains some perocssing file .

See here for more documentation

What is the name of the text column in your CSV file? You can fix this error by replacing “csv” with that column name when initializing the KeyDataset object.

what if I’m passing several columns and not just one column?

You need to combine these columns into a single text column before using the pipeline.

If it has to be a single column before using the pipeline is there a way that I can process each row of the file and get a result for that row?
ie.

if I have a file that is:
line1
line2
line3

I want to pass just line1 first, get result then pass line2, get result and so on. Thanks for the help so far!

You can use the built-in text builder to read the file line by line:

from datasets import load_dataset
ds = load_dataset("text", data_files="path/to/csv", split="train")
ds = ds.select(range(1, len(ds))) # remove the csv header

Another option is to use map to merge the columns.