I’m trying to publish a dataset which has two parts (ort_human
and ort_wmt
). It’s split across two JSONs and the columns in each table are different. Currently they are specified in the dataset card (README.md
) as:
configs:
- config_name: ort_human
data_files: ort_human.json
- config_name: ort_wmt
data_files: ort_wmt.json
When downloading it with load_dataset("zouharvi/optimal-reference-translations", "ort_human")
, I get the error ValueError: Couldn't cast .... to .... because column names don't match
. My understanding is that the library is trying to join them together.
What is the intended way of having two parts in the same dataset? They are closely related and it doesn’t make sense for them to be two repositories. Currently I just bypass this check by using streaming=True
and casting to list (this is supper ugly and actually a bug). Also custom loading scripts will be semi-deprecated in the near future.
Apologies if this has been answered here before. It seems like a common thing to want to do, though I was unable to find the relevant issue.