Dataset Configs with Different Columns

zouharvi · March 4, 2024, 1:31pm

I’m trying to publish a dataset which has two parts (ort_human and ort_wmt). It’s split across two JSONs and the columns in each table are different. Currently they are specified in the dataset card (README.md) as:

configs:
- config_name: ort_human
  data_files: ort_human.json
- config_name: ort_wmt
  data_files: ort_wmt.json

When downloading it with load_dataset("zouharvi/optimal-reference-translations", "ort_human"), I get the error ValueError: Couldn't cast .... to .... because column names don't match. My understanding is that the library is trying to join them together.

What is the intended way of having two parts in the same dataset? They are closely related and it doesn’t make sense for them to be two repositories. Currently I just bypass this check by using streaming=True and casting to list (this is supper ugly and actually a bug). Also custom loading scripts will be semi-deprecated in the near future.

Apologies if this has been answered here before. It seems like a common thing to want to do, though I was unable to find the relevant issue.

zouharvi · March 4, 2024, 1:58pm

On another machine I didn’t get this error. After updating datasets, it started working.

system · March 5, 2024, 1:59am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom dataset and cast_column 🤗Datasets	1	1429	April 7, 2022
Column Name Mismatch Error while Streaming? 🤗Datasets	0	209	June 20, 2024
Explicitly defining schema in a dataset? 🤗Datasets	1	1122	November 3, 2021
GPTQ quantization on Custom dataset 🤗Transformers	4	604	January 24, 2025
Get_dataset_config_names not getting desired output (and DatasetGenerationError) 🤗Datasets	5	92	December 11, 2024

Dataset Configs with Different Columns

Related topics