Dataset Configs with Different Columns

I’m trying to publish a dataset which has two parts (ort_human and ort_wmt). It’s split across two JSONs and the columns in each table are different. Currently they are specified in the dataset card (README.md) as:

configs:
- config_name: ort_human
  data_files: ort_human.json
- config_name: ort_wmt
  data_files: ort_wmt.json

When downloading it with load_dataset("zouharvi/optimal-reference-translations", "ort_human"), I get the error ValueError: Couldn't cast .... to .... because column names don't match. My understanding is that the library is trying to join them together.

What is the intended way of having two parts in the same dataset? They are closely related and it doesn’t make sense for them to be two repositories. Currently I just bypass this check by using streaming=True and casting to list (this is supper ugly and actually a bug). Also custom loading scripts will be semi-deprecated in the near future.

Apologies if this has been answered here before. It seems like a common thing to want to do, though I was unable to find the relevant issue.

On another machine I didn’t get this error. After updating datasets, it started working. :slight_smile:

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.