I want to validate my dataset that contains features of type str
, list(str)
and list(list(int))
(for NER).
Unfortunately, the Data-Measurements-Tool is not yet supporting cases with multiple label columns (see Handle the case where there are multiple label columns 路 Issue #36 路 huggingface/data-measurements-tool 路 GitHub).
While I did not find info whether lists and multiple label columns are supported by https://greatexpectations.io/, it seems Tensorflow Data Validation (tfdv) can.
Unfortunately, tfdv does not yet support generating statistics directly from arrow files (generate_statistics_from_pyarrow table or parquet 路 Issue #92 路 tensorflow/data-validation 路 GitHub), so I need to go through CSV, Pandas Dataframe, or TFRecord. I have difficulties with all three variants:
- CSV dumps a
list(list(int))
as stringarray([...]) array([...]) ....
, which tfdv does not understand -
export()
to a TFRecord crashes due to some type or serialization error: Export own dataset with different feature types to TFRecord -
to_pandas()
creates a dataframe with columns being series of objects, for whichtfdv.generate_statistics_from_dataframe(dataframe)
crashes withpyarrow.lib.ArrowTypeError: ("Expected bytes, got a 'numpy.ndarray' object", 'Conversion failed for column words with type object')
. Sincedataframe["words"] = dataframe["words"].astype(str)
avoids the above error message for column words, I guess I need to cast the dataframe columns. But how do I do this for lists (i.e. for Pandas Series of Pandas Series)? Is there some way thatto_pandas()
creates a correctly typed dataframe?dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64"))))
beforeto_pandas()
yields the same dataframe.
More generally: Has anyone done data validation with a NER dataset? Using which validation framework (such as Data-Measurements-Tool, tfdv, great expectations)?