I have a large dataset in webdatasets format and one of the columns is a list of strings.
I see here that the dataset viewer doesn鈥檛 seem capable of handling the list[str]
dtype at the moment, but I am wondering if the parquet-converter bot will modify the dtype to one that works when it converts the webdatasets files to parquet files.
Edit: This repo seems to show that list dtype work? HuggingFaceH4/OpenHermes-2.5-1k-longest 路 Datasets at Hugging Face
Do you have a dataset to share, so that we can investigate?
My dataset is available here: ProGamerGov/synthetic-dataset-1m-high-quality-captions 路 Datasets at Hugging Face
I set the YAML metadata to match what was done here, by using sequence: string
instead of dtype: string
: davanstrien/dataset-tldr-preference-dpo 路 Datasets at Hugging Face, but I鈥檓 not sure if that鈥檚 correct as I haven鈥檛 been able to find a ton of documentation for how to do things.
Have you looked at Data files Configuration, maybe it would help to configure the YAML.
Anyway, maybe @lhoestq can give you a hand.
So I鈥檝e looked at the configuration and I have been unable to figure out the issue.