Intention of the `length` field in class datasets.Sequence?

As far as I know, datasets.Sequence (document) is one of the FieldTypes in Dataset features, which specifies serialization format of the dataset.

Just curious, what is the length field for? It has a default value -1, I guess it means arbitrary length. Should I explicitly assign the length field if I know all my dataset samples have a fixed length? And what good is it? Does it improve performance when read the dataset?


When length is specified and not -1, we store a sequence as a fixed PyArrow list, which requires less memory than the variable length version, as the fixed one does not store the offsets.

Regarding the reading performance, I expect the fixed version to be faster (if you have time for a benchmark, feel free to share the results here :slightly_smiling_face: ).