CSV File Being Misinterpreted as Text in Hugging Face Dataset

i have a large csv file (~30GB) which is pre-tokenized nepali-dataset. it contains two columns: “input_ids” and “target_ids”. loading csv file locally with load_dataset works well

i.e. returns each row in format: {'input_ids': <str(list)>, 'target_ids':<str(list)>)
e.g. {'input_ids': '[239, 552,...], 'target_ids': '[552, 875, ...]}

However, trying to load same file after uploading to huggingface hub, each row is being returned in following format:

{'text': 'input_ids,target_ids'}
e.g. {'text': '"[239, 552,...]" , "[552, 875, ...]"'}

above output is consistent whether or not we use streaming=True.

Code we used to upload this file to huggingface hub

# put huggingface key in secrets as `HF_TOKEN`
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="tokenized_data.csv",
    path_in_repo="pre_tokenized/iriisnepal_u_nepberta_512.csv",
    repo_id="Aananda-giri/nepali_llm_datasets",
    repo_type="dataset"
)

and following are the configs added to README.md

---
configs:
 <other configs>
  - config_name: iriisnepal_u_nepberta_512
    data_files:
      - split: train
        path:
          - pre_tokenized/iriisnepal_u_nepberta_512.csv
---

and code used to access this file

# it loads entire dataset first
from datasets import load_dataset

# use streaming=True to avoid downloading entire dataset
data = load_dataset("Aananda-giri/nepali_llm_datasets", name="iriisnepal_u_nepberta_512", streaming=True)

for n,d in enumerate(data['train']):
  print(d)
  if n >= 2:
    break

'''
## output
{'text': 'input_ids,target_ids'}
{'text': '"[239, 552, ...
'''
  • datasets version = 3.2.0
1 Like

Silly idea but have you tried loading just a few lines from the file?

from datasets import load_dataset_builder
ds_builder = load_dataset_builder("Aananda-giri/nepali_llm_datasets", name="iriisnepal_u_nepberta_512")
ds_builder

returns

<datasets.packaged_modules.text.text.TextNepaliLlmDatasets at 0x7ff1465290c0>

but it should be
<datasets.packaged_modules.csv.csv.TextNepaliLlmDatasets at 0x7ff1465290c0>

1 Like