Intention of the `length` field in class datasets.Sequence?

RmZeta · March 22, 2023, 2:48pm

As far as I know, datasets.Sequence (document) is one of the FieldTypes in Dataset features, which specifies serialization format of the dataset.

Just curious, what is the length field for? It has a default value -1, I guess it means arbitrary length. Should I explicitly assign the length field if I know all my dataset samples have a fixed length? And what good is it? Does it improve performance when read the dataset?

mariosasko · March 23, 2023, 2:02pm

Hi!

When length is specified and not -1, we store a sequence as a fixed PyArrow list, which requires less memory than the variable length version, as the fixed one does not store the offsets.

Regarding the reading performance, I expect the fixed version to be faster (if you have time for a benchmark, feel free to share the results here ).

Topic		Replies	Views
Sequence_length vs context_length in autoformer Beginners	1	1498	November 23, 2023
Appropriate yaml for dataset_info list[float] 🤗Datasets	3	451	February 28, 2024
Chapter 5 questions Course	105	8436	July 7, 2025
SQuAD/BERT: Why max_length=384 by default and not 512? Models	1	2470	November 15, 2021
Understanding data of dataset_infos.json Beginners	2	1847	June 29, 2021

Intention of the `length` field in class datasets.Sequence?

Related topics