Advice on loading Tabular Data for Sequence Modelling

Hello all,

I am working on a project where some tabular data can be veiwed as a sequence of events (and I want to model it as a problem for an RNN).

For example, let us assume the following dataset:

dataset_len = 13
df = pd.DataFrame({
    "ID": pd.Series([1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3]),
    "order_column": pd.Series([1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4, 5]),
    "feature1": pd.Series(np.random.choice(["a", "b", "c", pd.NA], size=dataset_len)),
    "feature2": pd.Series(np.random.choice([True, False, pd.NA], size=dataset_len)),
    "feature3": pd.Series(np.random.choice([3.14, 2.72, 9.8, pd.NA], size=dataset_len)),
})

The dataset looks as follows:
image

Each sample is a sequence (ordered by the order_column) with features : feature1, feature2, feature3.

Loading the dataset using HF datasets was simple:

features = Features({
       "ID": Value("uint8"),
       "order_column": Value("uint8"),
       "feature1": Value("string"),
       "feature2": Value("bool"),
       "feature3": Value("float32"),
})
dataset = load_dataset(
    "parquet",
    data_files=[(OUTPUT_PATH / "sequence.parquet").as_posix()],
    features=features,
    split="train",
)

The issue at hand is to be able to iterate over the unique IDs i.e. dataset.unique("ID") and get the sequence of data points per ID to train an RNN.

Something like the below does not work well because the dataset is enormous and the data is not sorted in the first place: (My example has sorted data for better explanation)

from torch.utils.data import Dataset

class SequenceDataset(Dataset):
    def __init__(self, hf_dataset):
        self.hf_dataset = hf_dataset
        self.unique_ids = self.hf_dataset.unique("ID")

    def __len__(self):
        return len(self.unique_ids)

    def __getitem__(self, idx):
        return dataset.filter(lambda sample: sample["ID"] == self.unique_ids[idx])[:]

Adding to the above, the output is not as desired as well:

{
    'ID': [1, 1, 1, 1, 1],
    'order_column': [1, 2, 3, 4, 5],
    'feature1': ['a', 'c', None, 'a', 'c'],
    'feature2': [None, False, None, False, True],
    'feature3': [3.140000104904175, None, 2.7200000286102295, 3.140000104904175, 2.7200000286102295]
}

I would like each row as an input to the model (something like a transpose of what I am seeing here).

Hence I am seeking advice on how to approach the problem at hand.

Thanks in advance.

maybe dataset.filter(...).to_pandas().to_numpy() ?

Hello @lhoestq ,

Apologies for the delayed response. My dataset has north of 10M rows and filter for a single ID takes about 1225.5 seconds.

To train an model, this level of timing for data retrieval will not be suitable I suppose.

Then you better have a look-up table to be able to retrieve the samples quickly (using a python dictionary for example if it fits in memory)

Unfortunately the dataset size is (10M+, 98).
Hence I cannot fit it into memory and thought HF datasets could help as they don’t load the data into memory.

It seems like this is not possible with the currently available functionality.

Thanks a lot for the quick response.