KeyError: 'csv' using a csv file with KeyDataset

aenriquez · August 30, 2023, 3:59pm

I am trying to use the model robert-base-openai-detector from huggingface.

I get keyError csv when I try to use a csv file. Whenever I use a text file and change load_dataset and KeyDataset to text it works.

Do I need to do some preprocessing before or what am I doing wrong?

pipe = pipeline("text-classification", model="roberta-base-openai-detector")
dataset = load_dataset('csv', data_files=["/content/training testing split with college address.xlsx - Sheet1.csv"], split='train')

for out in tqdm(pipe(KeyDataset(dataset, "csv"))):
    print (out)

The error stack doesn’t give me much information.

KeyError                                  Traceback (most recent call last)
<ipython-input-5-98a64dbb56de> in <cell line: 10>()
      8 
      9 
---> 10 for out in tqdm(pipe(KeyDataset(dataset, "csv"))):
     11     print (out)
     12 

8 frames
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/pt_utils.py in __getitem__(self, i)
    303 
    304     def __getitem__(self, i):
--> 305         return self.dataset[i][self.key]
    306 
    307 

KeyError: 'csv'

vpkprasanna · August 31, 2023, 6:24am

load_dateset will accept the dataset name which is either uploaded in the hugging face datasets or you pass to a directory where it will process the file in some format which contains some perocssing file .

See here for more documentation

mariosasko · August 31, 2023, 5:01pm

What is the name of the text column in your CSV file? You can fix this error by replacing “csv” with that column name when initializing the KeyDataset object.

aenriquez · September 1, 2023, 5:32pm

what if I’m passing several columns and not just one column?

mariosasko · September 1, 2023, 6:31pm

You need to combine these columns into a single text column before using the pipeline.

aenriquez · September 19, 2023, 5:42pm

If it has to be a single column before using the pipeline is there a way that I can process each row of the file and get a result for that row?
ie.

if I have a file that is:
line1
line2
line3

I want to pass just line1 first, get result then pass line2, get result and so on. Thanks for the help so far!

mariosasko · September 20, 2023, 12:40pm

You can use the built-in text builder to read the file line by line:

from datasets import load_dataset
ds = load_dataset("text", data_files="path/to/csv", split="train")
ds = ds.select(range(1, len(ds))) # remove the csv header

Another option is to use map to merge the columns.

Topic		Replies	Views
CSV File Being Misinterpreted as Text in Hugging Face Dataset 🤗Datasets	1	42	December 27, 2024
Correct way to create a Dataset from a csv file Beginners	13	14053	March 25, 2022
Convert .csv into dataset.Dataset Beginners	2	7092	March 20, 2022
Load_dataset did not load the text file? Beginners	0	650	June 7, 2021
HF Datasets loading csv Beginners	1	1092	January 30, 2021

KeyError: 'csv' using a csv file with KeyDataset

Related topics