datasets.Dataset.sort() does not preserve ordering

roccofortuna · January 6, 2023, 12:11pm

It looks like sort() does not preserve ordering, and it does not support sorting on multiple columns, nor a key function.
Preserved ordering when sorting is very handy when one needs to sort on multiple columns, A and B, so that e.g. whenever A is equal for two or more rows, B is kept sorted.

In native Python one could do the latter (other than sorting by a key function specifying that) by sorting by B first, then by A.

What’s the recommended way to achieve this with Datasets?

lhoestq · January 13, 2023, 4:37pm

Hi ! I don’t think this case is supported right now, but feel free to open an issue on github in case someone would like to contribute this feature We can imagine having something similar to pandas and be able to specify multiple columns for sorting. We’re already using pandas under the hood to do the sorting in datasets.

In the meantime feel free to convert your dataset to pandas and use df.sort_values !

roccofortuna · January 16, 2023, 9:31am

Sounds good! Thanks.

github.com/huggingface/datasets

Add preserve ordering param in datasets.Dataset.sort()

opened 09:22AM - 16 Jan 23 UTC

rocco-fortuna

enhancement

### Feature request From discussion on forum: https://discuss.huggingface.co/…t/datasets-dataset-sort-does-not-preserve-ordering/29065/1 `sort()` does not preserve ordering, and it does not support sorting on multiple columns, nor a key function. The suggested solution: > ... having something similar to pandas and be able to specify multiple columns for sorting. We’re already using pandas under the hood to do the sorting in datasets. The suggested workaround: > convert your dataset to pandas and use `df.sort_values()` ### Motivation Preserved ordering when sorting is very handy when one needs to sort on multiple columns, A and B, so that e.g. whenever A is equal for two or more rows, B is kept sorted. Having a parameter to do this in 🤗datasets would be cleaner than going through pandas and back, and it wouldn't add much complexity to the library. Alternatives: - the possibility to specify multiple keys to sort by with decreasing priority, - the ability to provide a key function for sorting, so that one can manually specify the sorting criteria. ### Your contribution I'll be happy to contribute by submitting a PR. Will get documented on `CONTRIBUTING.MD`. Would love to get thoughts on this, if anyone has anything to add.

Topic		Replies	Views
I can't concatenate_datasets because features are not sorted. How do I sort it? Beginners	3	5448	August 11, 2021
Does `Dataset.map(..., batched=True, batch_size=N)` save the original order? 🤗Datasets	2	1322	June 28, 2024
Shuffle a Single Feature (column) in a Dataset Beginners	3	1388	January 3, 2022
Get all unique labels values in a sorted manner Beginners	2	1873	December 4, 2024
Copy columns in a dataset and compute statistics for a column 🤗Datasets	13	1974	July 10, 2024

datasets.Dataset.sort() does not preserve ordering

Related topics