Shuffle a Single Feature (column) in a Dataset

Hi,
I am learning the dataset API. The shuffle API states that it rearranges the values of a column but from my experimentations, it shuffles the rows.

The code documentation is more clear and states that the rows are shuffled.

To achieve column shuffle I used the map functionality (batch=True) and created the following mapper function:

def _shuffle_question_column_batch(examples):
    questions = examples["question"]
    Random(42).shuffle(questions)
    examples["question"] = questions
    return examples

I am wondering whether the shuffle API is capable of rearranging values of a single column or a better way exists.

Please advice.

Using the imdb (movie review dataset) data as an example, this is 1000s of movie reviews, with columns being the text for the movie review and then the label (0 or 1). We wouldn’t want to shuffle the columns - this would only be swapping the text and the label - there is no benefit to that. We care about shuffling the rows. This is what the shuffle method does.

Meaning that the documentation is wrong.

Hi! Yes, shuffle rearranges the rows of a Dataset. I agree this part of the docs is not very clear (cc @stevhliu). And if you are interested in shuffling a single column, you can do the following:

import datasets
single_col = <col_name>  # specify the column name here
dset_single_col = dset.remove_columns([col for col in dset.column_names if col != single_col])
dset_single_col_shuffled = dset_single_col.shuffle()
dset_without_single_col = dset.remove_columns([single_col])
dset = datasets.concatenate_datasets([dset_without_single_col, dset_single_col_shuffled], axis=1)
1 Like