Explicitly defining schema in a dataset?

SocialGrep · November 2, 2021, 9:30pm

Hello,

We are trying to create a dataset with two main kinds of tabular data. They are both stored in CSV files, and have mostly the same fields, albeit with a few important differences. The Dataset library has trouble reading the dataset in any way. The files are neither concatenated into one with missing fields nulled, nor split into two separate entities in the object - both ways would be perfectly acceptable for our goals. Is there any configuration work we can do to prevent rebuilding large sets like this from scratch?

mariosasko · November 3, 2021, 10:06pm

Hi,

If I understand you correctly, you are trying to create a dataset from two CSV files in one read, e.g.:

dset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])

which errors out because these files don’t share the same set of fields.

If that’s the case, you have two options:

create a separate dataset from each file, add None values where needed (either with map to have them cached in a file or with add_column to have them in RAM), and concatenate the datasets (make sure the datasets have matching schemas before the concatenate_datasets call - if not, use cast)
add a dataset loading script to the Hub (the preferred way if you want to load the dataset directly from the Hub without additional processing described in Option 1). There you can easily assign None to missing values. Or, to avoid missing data, you can have a separate config for each CSV file. More info on writing a dataset loading script can be found here, and an example of the script which reads CSV data here.

Topic		Replies	Views
Passing schema features to a load_dataset function 🤗Datasets	4	1418	August 26, 2021
How to merge two dataset objects? Beginners	7	44563	February 28, 2024
KeyError: 'Field "builder_name" does not exist in table schema' 🤗Datasets	5	1778	January 20, 2022
Dataset Configs with Different Columns 🤗Datasets	2	538	March 4, 2024
Joining datasets by column & best practices for multi-view datasets 🤗Datasets	3	2937	May 13, 2024

Explicitly defining schema in a dataset?

Related topics