How to Process and Cache Multidimensional Text Tensors

mikesong724 · October 18, 2022, 3:23pm

Hi all,

I am working on a multi-dimensional text data problem. Specifically, each data point example is a list of at most N documents. So the cardinality of each data point could vary. They are first processed through BERT and then aggregated with a hierarchical aggregation layer as done in HAN.

The input is a parquet file where each row is (string, string, string, list of text). Each batch of processed dataset is a (B, N, 512) tensor.

Currently I tokenize the documents on the fly and this is not ideal as I want to tokenize and cache them. Aside from tokenizing them and collating as a separate process, are there ways to process this raw data into the desired tensor with caching and datasets package?

Thank you very much in advance!

lhoestq · October 21, 2022, 10:50am

Hi ! Using map on your dataset you can compute the tensors in advance and the result is cached

You can find some examples of tokenization in the docs here: Process text data

Topic		Replies	Views
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1908	November 4, 2022
BERT embeddings on big dataset 🤗Datasets	3	122	August 28, 2024
How to use several datasets that fit into the RAM? 🤗Datasets	1	495	November 5, 2021
Caching tokenization 🤗Tokenizers	0	240	January 14, 2024
Tokenization on dataset produces invalid pytorch tensor shape 🤗Datasets	1	928	August 16, 2022

How to Process and Cache Multidimensional Text Tensors

Related topics