Shuffle a Single Feature (column) in a Dataset

maxim-fl · December 29, 2021, 6:48am

Hi,
I am learning the dataset API. The shuffle API states that it rearranges the values of a column but from my experimentations, it shuffles the rows.

The code documentation is more clear and states that the rows are shuffled.

To achieve column shuffle I used the map functionality (batch=True) and created the following mapper function:

def _shuffle_question_column_batch(examples):
    questions = examples["question"]
    Random(42).shuffle(questions)
    examples["question"] = questions
    return examples

I am wondering whether the shuffle API is capable of rearranging values of a single column or a better way exists.

Please advice.

jon-fernandes · December 29, 2021, 6:49pm

Using the imdb (movie review dataset) data as an example, this is 1000s of movie reviews, with columns being the text for the movie review and then the label (0 or 1). We wouldn’t want to shuffle the columns - this would only be swapping the text and the label - there is no benefit to that. We care about shuffling the rows. This is what the shuffle method does.

maxim-fl · December 30, 2021, 9:28am

Meaning that the documentation is wrong.

mariosasko · January 3, 2022, 11:48am

Hi! Yes, shuffle rearranges the rows of a Dataset. I agree this part of the docs is not very clear (cc @stevhliu). And if you are interested in shuffling a single column, you can do the following:

import datasets
single_col = <col_name>  # specify the column name here
dset_single_col = dset.remove_columns([col for col in dset.column_names if col != single_col])
dset_single_col_shuffled = dset_single_col.shuffle()
dset_without_single_col = dset.remove_columns([single_col])
dset = datasets.concatenate_datasets([dset_without_single_col, dset_single_col_shuffled], axis=1)

Topic		Replies	Views
Dataset map and flatten 🤗Datasets	5	2971	October 12, 2020
Caching and Shuffling Datasets on the Same Machine 🤗Datasets	1	393	July 21, 2023
Behavior of shuffled parquet dataset 🤗Datasets	1	98	November 30, 2024
Calling shuffle on an `IterableDataset` converts float32 to float64 🤗Datasets	0	129	December 28, 2023
Dataset to pandas dataframe and back to dataset Beginners	5	4355	February 23, 2022

Shuffle a Single Feature (column) in a Dataset

Related topics