Perform data validation (e.g. with Tensorflow Data Validation) on a 🤗 Dataset with list features

dball · September 29, 2022, 3:06pm

I want to validate my dataset that contains features of type str, list(str) and list(list(int)) (for NER).

Unfortunately, the Data-Measurements-Tool is not yet supporting cases with multiple label columns (see Handle the case where there are multiple label columns · Issue #36 · huggingface/data-measurements-tool · GitHub).

While I did not find info whether lists and multiple label columns are supported by https://greatexpectations.io/, it seems Tensorflow Data Validation (tfdv) can.

Unfortunately, tfdv does not yet support generating statistics directly from arrow files (generate_statistics_from_pyarrow table or parquet · Issue #92 · tensorflow/data-validation · GitHub), so I need to go through CSV, Pandas Dataframe, or TFRecord. I have difficulties with all three variants:

CSV dumps a list(list(int)) as string array([...]) array([...]) ...., which tfdv does not understand
export() to a TFRecord crashes due to some type or serialization error: Export own dataset with different feature types to TFRecord
to_pandas() creates a dataframe with columns being series of objects, for which tfdv.generate_statistics_from_dataframe(dataframe) crashes with pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'numpy.ndarray' object", 'Conversion failed for column words with type object'). Since dataframe["words"] = dataframe["words"].astype(str) avoids the above error message for column words, I guess I need to cast the dataframe columns. But how do I do this for lists (i.e. for Pandas Series of Pandas Series)? Is there some way that to_pandas() creates a correctly typed dataframe? dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before to_pandas() yields the same dataframe.

More generally: Has anyone done data validation with a NER dataset? Using which validation framework (such as Data-Measurements-Tool, tfdv, great expectations)?

Topic		Replies	Views
'list' as a feature in huggingface dataset 🤗Datasets	1	1151	May 25, 2023
Export own dataset with different feature types to TFRecord 🤗Datasets	6	1341	April 17, 2023
Union types, or features that can be multiple types 🤗Datasets	1	770	June 30, 2022
Having an issue with 'NoneType' after using to_df_dataset() function Beginners	3	3096	January 13, 2024
Sequence features - Class Label Cast_ 🤗Datasets	9	1315	July 4, 2023

Perform data validation (e.g. with Tensorflow Data Validation) on a 🤗 Dataset with list features

Related topics