Using datasets with sequences of different length under one index

ed1981 · January 18, 2022, 10:29am

I have the following situation:

Millions of users with unique ids.
Each user exchanges some number of messages with a system.
Each message has some properties, for example, the timestamp when the message was sent / received.

Inputs to a model are whole conversations of individual users, and not individual messages, in other words, a sequence of messages. So it is logical that when I query a dataset, that key should be user_id and value should be a list / array of messages.

Since the dataset is very huge, I usually stored this in hdf5 files, where keys and values are as I described. That worked, sorta, in combination with pytorch Dataset / Dataloader but I’m looking for a better solution than hdf5 files / h5py library.

One logical way how one could do it using datasets is by creating the DatasetDict and many Dataset instances, each for one conversation. That actually worked as well, on a small sample, but I’m not sure if that is scalable. I feel like I’m misusing the DatasetDict, as in the examples it’s usually done for splits. Also, when saving the DatasetDict you end up with a directory for each Dataset in the same directory, which would be pretty bad when you go on millions of users scale.

What would you recommend in this situation? I can imagine a number of problems where individual samples are sequences of different lengths.

Topic		Replies	Views
Fetching rows of a large Dataset by index 🤗Datasets	10	1627	March 15, 2021
Datasets: Limit the number of rows? Beginners	4	8337	December 17, 2023
Creating a dataset with custom data Beginners	3	8638	September 5, 2022
Support of very large dataset? 🤗Datasets	12	10336	August 24, 2022
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1053	October 16, 2023

Using datasets with sequences of different length under one index

Related topics