How to Process and Cache Multidimensional Text Tensors

Hi all,

I am working on a multi-dimensional text data problem. Specifically, each data point example is a list of at most N documents. So the cardinality of each data point could vary. They are first processed through BERT and then aggregated with a hierarchical aggregation layer as done in HAN.

The input is a parquet file where each row is (string, string, string, list of text). Each batch of processed dataset is a (B, N, 512) tensor.

Currently I tokenize the documents on the fly and this is not ideal as I want to tokenize and cache them. Aside from tokenizing them and collating as a separate process, are there ways to process this raw data into the desired tensor with caching and datasets package?

Thank you very much in advance!

Hi ! Using map on your dataset you can compute the tensors in advance and the result is cached

You can find some examples of tokenization in the docs here: Process text data