Explicitly defining schema in a dataset?

Hello,

We are trying to create a dataset with two main kinds of tabular data. They are both stored in CSV files, and have mostly the same fields, albeit with a few important differences. The Dataset library has trouble reading the dataset in any way. The files are neither concatenated into one with missing fields nulled, nor split into two separate entities in the object - both ways would be perfectly acceptable for our goals. Is there any configuration work we can do to prevent rebuilding large sets like this from scratch?

Hi,

If I understand you correctly, you are trying to create a dataset from two CSV files in one read, e.g.:

dset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])

which errors out because these files don’t share the same set of fields.

If that’s the case, you have two options:

  1. create a separate dataset from each file, add None values where needed (either with map to have them cached in a file or with add_column to have them in RAM), and concatenate the datasets (make sure the datasets have matching schemas before the concatenate_datasets call - if not, use cast)
  2. add a dataset loading script to the Hub (the preferred way if you want to load the dataset directly from the Hub without additional processing described in Option 1). There you can easily assign None to missing values. Or, to avoid missing data, you can have a separate config for each CSV file. More info on writing a dataset loading script can be found here, and an example of the script which reads CSV data here.
1 Like