How to add a new column using only streaming dataset from remote?

I recently made a speech dataset using webdataset format then upload hf hub. but it is so hard to add new column to existing tar files, so decided to recreate whole dataset familiar with adding new column.

Main concern is i have no enough storage, so i do not want to download whole dataset if i want to add new column. Is it possible using datasets parquet based dataset in hf hub? adding column using only streaming data loading.

1 Like

Yup, you can even merge two datasets with different columns together if it’s easier for you

ds = ds.add_column("new_col", my_list)
# OR
other_ds_with_new_col = load_dataset(...)
ds = concatenate_datasets([ds, other_ds_with_new_col], axis=1)
1 Like

@lhoestq Thanks! Adding column works as expected.
one more question, is it possible to push to hub new dataset with added column not dumping whole parquets in local storage? Also, Iterabledataset does not have push_to_hub method.

dataset = load_dataset("...", streaming=True)  # large dataset
new_column_values = "..."
dataset = dataset.add_column("new_col", new_column_values)

dataset.push_to_hub("...")  # error, IterableDataset has no push_to_hub

I think I can use just by pushing new column as dataset with same row order of original dataset, then use them along with concatenate_datasets. But, if there’s some way to push_to_hub concatenated iterable dataset, it would be best.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.