Explicitly defining schema in a dataset?

mariosasko · November 3, 2021, 10:06pm

Hi,

If I understand you correctly, you are trying to create a dataset from two CSV files in one read, e.g.:

dset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])

which errors out because these files don’t share the same set of fields.

If that’s the case, you have two options:

create a separate dataset from each file, add None values where needed (either with map to have them cached in a file or with add_column to have them in RAM), and concatenate the datasets (make sure the datasets have matching schemas before the concatenate_datasets call - if not, use cast)
add a dataset loading script to the Hub (the preferred way if you want to load the dataset directly from the Hub without additional processing described in Option 1). There you can easily assign None to missing values. Or, to avoid missing data, you can have a separate config for each CSV file. More info on writing a dataset loading script can be found here, and an example of the script which reads CSV data here.

Topic		Replies	Views
Passing schema features to a load_dataset function 🤗Datasets	4	1418	August 26, 2021
How to merge two dataset objects? Beginners	7	44559	February 28, 2024
KeyError: 'Field "builder_name" does not exist in table schema' 🤗Datasets	5	1778	January 20, 2022
Dataset Configs with Different Columns 🤗Datasets	2	538	March 4, 2024
Joining datasets by column & best practices for multi-view datasets 🤗Datasets	3	2937	May 13, 2024