Explicitly defining schema in a dataset?

Hi,

If I understand you correctly, you are trying to create a dataset from two CSV files in one read, e.g.:

dset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])

which errors out because these files don’t share the same set of fields.

If that’s the case, you have two options:

  1. create a separate dataset from each file, add None values where needed (either with map to have them cached in a file or with add_column to have them in RAM), and concatenate the datasets (make sure the datasets have matching schemas before the concatenate_datasets call - if not, use cast)
  2. add a dataset loading script to the Hub (the preferred way if you want to load the dataset directly from the Hub without additional processing described in Option 1). There you can easily assign None to missing values. Or, to avoid missing data, you can have a separate config for each CSV file. More info on writing a dataset loading script can be found here, and an example of the script which reads CSV data here.
1 Like