I have a dataset of texts that I want to split into shorter texts

amazingvince · October 15, 2023, 4:20pm

I have a bunch of long text in a dataset. I want to write a map function such that I split these long samples into multiple shorter samples. Can this be done with Datasets? I saw some stuff around about returning a list of row dictionaries. I tried this and it did not work. I also tried a single dict with list of what should go in the columns. I get errors out of pyarrow either way. Any suggestions about how I should go about doing this. Thanks

mariosasko · October 16, 2023, 1:58pm

This is possible in the batched map mode, as explained here. Note that map requires all the columns in the returned batch to match in length, so either pass remove_columns=dataset.column_names or transform the rest of the columns to make them equal in size to avoid an error.

Topic		Replies	Views
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2667	March 3, 2024
How to use `map` or similar when one row is mapped to multiple rows? 🤗Datasets	1	2817	July 20, 2021
Dataset map and flatten 🤗Datasets	5	2990	October 12, 2020
Transform list-like elements to rows 🤗Datasets	2	1158	May 14, 2021
Weird example of batching in Dataset.map document 🤗Datasets	4	1042	September 4, 2023

I have a dataset of texts that I want to split into shorter texts

Related topics