Hi,
If I understand you correctly, you are trying to create a dataset from two CSV files in one read, e.g.:
dset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])
which errors out because these files don’t share the same set of fields.
If that’s the case, you have two options:
- create a separate dataset from each file, add
None
values where needed (either withmap
to have them cached in a file or withadd_column
to have them in RAM), and concatenate the datasets (make sure the datasets have matching schemas before theconcatenate_datasets
call - if not, usecast
) - add a dataset loading script to the Hub (the preferred way if you want to load the dataset directly from the Hub without additional processing described in Option 1). There you can easily assign
None
to missing values. Or, to avoid missing data, you can have a separate config for each CSV file. More info on writing a dataset loading script can be found here, and an example of the script which reads CSV data here.