I have a bunch of long text in a dataset. I want to write a map function such that I split these long samples into multiple shorter samples. Can this be done with Datasets? I saw some stuff around about returning a list of row dictionaries. I tried this and it did not work. I also tried a single dict with list of what should go in the columns. I get errors out of pyarrow either way. Any suggestions about how I should go about doing this. Thanks
This is possible in the batched map
mode, as explained here. Note that map
requires all the columns in the returned batch to match in length, so either pass remove_columns=dataset.column_names
or transform the rest of the columns to make them equal in size to avoid an error.