i have a large csv file (~30GB) which is pre-tokenized nepali-dataset. it contains two columns: “input_ids” and “target_ids”. loading csv file locally with load_dataset
works well
i.e. returns each row in format: {'input_ids': <str(list)>, 'target_ids':<str(list)>)
e.g. {'input_ids': '[239, 552,...], 'target_ids': '[552, 875, ...]}
However, trying to load same file after uploading to huggingface hub, each row is being returned in following format:
{'text': 'input_ids,target_ids'}
e.g. {'text': '"[239, 552,...]" , "[552, 875, ...]"'}
above output is consistent whether or not we use streaming=True
.
Code we used to upload this file to huggingface hub
# put huggingface key in secrets as `HF_TOKEN`
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
path_or_fileobj="tokenized_data.csv",
path_in_repo="pre_tokenized/iriisnepal_u_nepberta_512.csv",
repo_id="Aananda-giri/nepali_llm_datasets",
repo_type="dataset"
)
and following are the configs added to README.md
---
configs:
<other configs>
- config_name: iriisnepal_u_nepberta_512
data_files:
- split: train
path:
- pre_tokenized/iriisnepal_u_nepberta_512.csv
---
and code used to access this file
# it loads entire dataset first
from datasets import load_dataset
# use streaming=True to avoid downloading entire dataset
data = load_dataset("Aananda-giri/nepali_llm_datasets", name="iriisnepal_u_nepberta_512", streaming=True)
for n,d in enumerate(data['train']):
print(d)
if n >= 2:
break
'''
## output
{'text': 'input_ids,target_ids'}
{'text': '"[239, 552, ...
'''
- datasets version = 3.2.0