Using datasets with sequences of different length under one index

I have the following situation:

  • Millions of users with unique ids.
  • Each user exchanges some number of messages with a system.
  • Each message has some properties, for example, the timestamp when the message was sent / received.

Inputs to a model are whole conversations of individual users, and not individual messages, in other words, a sequence of messages. So it is logical that when I query a dataset, that key should be user_id and value should be a list / array of messages.

Since the dataset is very huge, I usually stored this in hdf5 files, where keys and values are as I described. That worked, sorta, in combination with pytorch Dataset / Dataloader but I’m looking for a better solution than hdf5 files / h5py library.

One logical way how one could do it using datasets is by creating the DatasetDict and many Dataset instances, each for one conversation. That actually worked as well, on a small sample, but I’m not sure if that is scalable. I feel like I’m misusing the DatasetDict, as in the examples it’s usually done for splits. Also, when saving the DatasetDict you end up with a directory for each Dataset in the same directory, which would be pretty bad when you go on millions of users scale.

What would you recommend in this situation? I can imagine a number of problems where individual samples are sequences of different lengths.