Advice on loading Tabular Data for Sequence Modelling

karthikrangasai · June 4, 2024, 6:32pm

Hello all,

I am working on a project where some tabular data can be veiwed as a sequence of events (and I want to model it as a problem for an RNN).

For example, let us assume the following dataset:

dataset_len = 13
df = pd.DataFrame({
    "ID": pd.Series([1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3]),
    "order_column": pd.Series([1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4, 5]),
    "feature1": pd.Series(np.random.choice(["a", "b", "c", pd.NA], size=dataset_len)),
    "feature2": pd.Series(np.random.choice([True, False, pd.NA], size=dataset_len)),
    "feature3": pd.Series(np.random.choice([3.14, 2.72, 9.8, pd.NA], size=dataset_len)),
})

The dataset looks as follows:

Each sample is a sequence (ordered by the order_column) with features : feature1, feature2, feature3.

Loading the dataset using HF datasets was simple:

features = Features({
       "ID": Value("uint8"),
       "order_column": Value("uint8"),
       "feature1": Value("string"),
       "feature2": Value("bool"),
       "feature3": Value("float32"),
})
dataset = load_dataset(
    "parquet",
    data_files=[(OUTPUT_PATH / "sequence.parquet").as_posix()],
    features=features,
    split="train",
)

The issue at hand is to be able to iterate over the unique IDs i.e. dataset.unique("ID") and get the sequence of data points per ID to train an RNN.

Something like the below does not work well because the dataset is enormous and the data is not sorted in the first place: (My example has sorted data for better explanation)

from torch.utils.data import Dataset

class SequenceDataset(Dataset):
    def __init__(self, hf_dataset):
        self.hf_dataset = hf_dataset
        self.unique_ids = self.hf_dataset.unique("ID")

    def __len__(self):
        return len(self.unique_ids)

    def __getitem__(self, idx):
        return dataset.filter(lambda sample: sample["ID"] == self.unique_ids[idx])[:]

Adding to the above, the output is not as desired as well:

{
    'ID': [1, 1, 1, 1, 1],
    'order_column': [1, 2, 3, 4, 5],
    'feature1': ['a', 'c', None, 'a', 'c'],
    'feature2': [None, False, None, False, True],
    'feature3': [3.140000104904175, None, 2.7200000286102295, 3.140000104904175, 2.7200000286102295]
}

I would like each row as an input to the model (something like a transpose of what I am seeing here).

Hence I am seeking advice on how to approach the problem at hand.

Thanks in advance.

lhoestq · June 5, 2024, 1:21pm

maybe dataset.filter(...).to_pandas().to_numpy() ?

karthikrangasai · June 10, 2024, 2:49pm

Hello @lhoestq ,

Apologies for the delayed response. My dataset has north of 10M rows and filter for a single ID takes about 1225.5 seconds.

To train an model, this level of timing for data retrieval will not be suitable I suppose.

lhoestq · June 10, 2024, 3:28pm

Then you better have a look-up table to be able to retrieve the samples quickly (using a python dictionary for example if it fits in memory)

karthikrangasai · June 10, 2024, 5:12pm

Unfortunately the dataset size is (10M+, 98).
Hence I cannot fit it into memory and thought HF datasets could help as they don’t load the data into memory.

It seems like this is not possible with the currently available functionality.

Thanks a lot for the quick response.

Topic		Replies	Views
Loading simple csv data for time series transformer Beginners	1	997	October 30, 2023
Loading HF datasets with variable size array using pyarrow with the appropriate schema 🤗Datasets	0	37	November 11, 2024
Loading specific features in a JSON dataset 🤗Datasets	1	722	December 4, 2023
Loading data from Datasets takes too much memory 🤗Datasets	2	530	January 18, 2024
Using PyTorch Dataset Class with Dataset Builder 🤗Datasets	3	60	January 29, 2025

Advice on loading Tabular Data for Sequence Modelling

Related topics