Perform data validation (e.g. with Tensorflow Data Validation) on a 馃 Dataset with list features

I want to validate my :hugs: dataset that contains features of type str, list(str) and list(list(int)) (for NER).

Unfortunately, the :hugs: Data-Measurements-Tool is not yet supporting cases with multiple label columns (see Handle the case where there are multiple label columns 路 Issue #36 路 huggingface/data-measurements-tool 路 GitHub).

While I did not find info whether lists and multiple label columns are supported by https://greatexpectations.io/, it seems Tensorflow Data Validation (tfdv) can.

Unfortunately, tfdv does not yet support generating statistics directly from arrow files (generate_statistics_from_pyarrow table or parquet 路 Issue #92 路 tensorflow/data-validation 路 GitHub), so I need to go through CSV, Pandas Dataframe, or TFRecord. I have difficulties with all three variants:

  • CSV dumps a list(list(int)) as string array([...]) array([...]) ...., which tfdv does not understand
  • export() to a TFRecord crashes due to some type or serialization error: Export own dataset with different feature types to TFRecord
  • to_pandas() creates a dataframe with columns being series of objects, for which tfdv.generate_statistics_from_dataframe(dataframe) crashes with pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'numpy.ndarray' object", 'Conversion failed for column words with type object'). Since dataframe["words"] = dataframe["words"].astype(str) avoids the above error message for column words, I guess I need to cast the dataframe columns. But how do I do this for lists (i.e. for Pandas Series of Pandas Series)? Is there some way that to_pandas() creates a correctly typed dataframe? dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before to_pandas() yields the same dataframe.

More generally: Has anyone done data validation with a NER :hugs: dataset? Using which validation framework (such as :hugs: Data-Measurements-Tool, tfdv, great expectations)?