I have a bunch of long text in a dataset. I want to write a map function such that I split these long samples into multiple shorter samples. Can this be done with Datasets? I saw some stuff around about returning a list of row dictionaries. I tried this and it did not work. I also tried a single dict with list of what should go in the columns. I get errors out of pyarrow either way. Any suggestions about how I should go about doing this. Thanks
This is possible in the batched
map mode, as explained here. Note that
map requires all the columns in the returned batch to match in length, so either pass
remove_columns=dataset.column_names or transform the rest of the columns to make them equal in size to avoid an error.