CSV File Being Misinterpreted as Text in Hugging Face Dataset

Aananda-giri · December 27, 2024, 10:13am

link-of-csv-file-uploaded-to-hub

i have a large csv file (~30GB) which is pre-tokenized nepali-dataset. it contains two columns: “input_ids” and “target_ids”. loading csv file locally with load_dataset works well

i.e. returns each row in format: {'input_ids': <str(list)>, 'target_ids':<str(list)>)
e.g. {'input_ids': '[239, 552,...], 'target_ids': '[552, 875, ...]}

However, trying to load same file after uploading to huggingface hub, each row is being returned in following format:

{'text': 'input_ids,target_ids'}
e.g. {'text': '"[239, 552,...]" , "[552, 875, ...]"'}

above output is consistent whether or not we use streaming=True.

Code we used to upload this file to huggingface hub

# put huggingface key in secrets as `HF_TOKEN`
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="tokenized_data.csv",
    path_in_repo="pre_tokenized/iriisnepal_u_nepberta_512.csv",
    repo_id="Aananda-giri/nepali_llm_datasets",
    repo_type="dataset"
)

and following are the configs added to README.md

---
configs:
 <other configs>
  - config_name: iriisnepal_u_nepberta_512
    data_files:
      - split: train
        path:
          - pre_tokenized/iriisnepal_u_nepberta_512.csv
---

and code used to access this file

# it loads entire dataset first
from datasets import load_dataset

# use streaming=True to avoid downloading entire dataset
data = load_dataset("Aananda-giri/nepali_llm_datasets", name="iriisnepal_u_nepberta_512", streaming=True)

for n,d in enumerate(data['train']):
  print(d)
  if n >= 2:
    break

'''
## output
{'text': 'input_ids,target_ids'}
{'text': '"[239, 552, ...
'''

datasets version = 3.2.0

mahmutc · December 27, 2024, 7:51pm

Silly idea but have you tried loading just a few lines from the file?

from datasets import load_dataset_builder
ds_builder = load_dataset_builder("Aananda-giri/nepali_llm_datasets", name="iriisnepal_u_nepberta_512")
ds_builder

returns

<datasets.packaged_modules.text.text.TextNepaliLlmDatasets at 0x7ff1465290c0>

but it should be
<datasets.packaged_modules.csv.csv.TextNepaliLlmDatasets at 0x7ff1465290c0>

Topic		Replies	Views
HF Datasets loading csv Beginners	1	1101	January 30, 2021
Convert .csv into dataset.Dataset Beginners	2	7175	March 20, 2022
Passing schema features to a load_dataset function 🤗Datasets	4	1450	August 26, 2021
Answer column not dictionary it is string when load csv using load_dataset 🤗Datasets	1	319	May 2, 2023
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1588	August 23, 2021

CSV File Being Misinterpreted as Text in Hugging Face Dataset

Code we used to upload this file to huggingface hub

and following are the configs added to README.md

and code used to access this file

Related topics