Perform data validation (e.g. with Tensorflow Data Validation) on a 馃 Dataset with list features

I want to validate my :hugs: dataset that contains features of type str, list(str) and list(list(int)) (for NER).

Unfortunately, the :hugs: Data-Measurements-Tool is not yet supporting cases with multiple label columns (see Handle the case where there are multiple label columns 路 Issue #36 路 huggingface/data-measurements-tool 路 GitHub).

While I did not find info whether lists and multiple label columns are supported by, it seems Tensorflow Data Validation (tfdv) can.

Unfortunately, tfdv does not yet support generating statistics directly from arrow files (generate_statistics_from_pyarrow table or parquet 路 Issue #92 路 tensorflow/data-validation 路 GitHub), so I need to go through CSV, Pandas Dataframe, or TFRecord. I have difficulties with all three variants:

  • CSV dumps a list(list(int)) as string array([...]) array([...]) ...., which tfdv does not understand
  • export() to a TFRecord crashes due to some type or serialization error: Export own dataset with different feature types to TFRecord
  • to_pandas() creates a dataframe with columns being series of objects, for which tfdv.generate_statistics_from_dataframe(dataframe) crashes with pyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'numpy.ndarray' object", 'Conversion failed for column words with type object'). Since dataframe["words"] = dataframe["words"].astype(str) avoids the above error message for column words, I guess I need to cast the dataframe columns. But how do I do this for lists (i.e. for Pandas Series of Pandas Series)? Is there some way that to_pandas() creates a correctly typed dataframe? dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before to_pandas() yields the same dataframe.

More generally: Has anyone done data validation with a NER :hugs: dataset? Using which validation framework (such as :hugs: Data-Measurements-Tool, tfdv, great expectations)?