Appropriate yaml for dataset_info list[float]

gabrielaltay · February 23, 2024, 1:01am

We have the ability to specify dataset features in the README.md as yaml (Create a dataset card). The example below includes answers column which is a sequence (squad · Datasets at Hugging Face).

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32

What would be the appropriate yaml for a column that is a list of float? (e.g. an embedding per row). I’d like to use list instead of sequence. If I was defining it in a dataset build script it would look like this,

datasets.Features(
    {
        "id": datasets.Value("string"),
        "title": datasets.Value("string"),
        "context": datasets.Value("string"),
        "question": datasets.Value("string"),
        "vecs": [datasets.Value("float16")],
    }
)

this looks strange to me but seems to work,

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: vecs
    list:
      dtype: float16

mariosasko · February 27, 2024, 7:37pm

Hi! You can verify the correct format with a Features._to_yaml_list method (use yaml.safe_dump(features.to_yaml_list()) to get the actual string).

gabrielaltay · February 28, 2024, 6:43am

thanks! yea, looks like yaml.safe_dump(features._to_yaml_list()) gives

  - name: question
    dtype: string
  - name: vecs
    list: float16

system · February 28, 2024, 6:44pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Setting dataset feature value as numpy array 🤗Datasets	7	7812	November 14, 2023
'list' as a feature in huggingface dataset 🤗Datasets	1	1126	May 25, 2023
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	150	October 24, 2024
How to type annotate a dataset which has specific column names 🤗Datasets	2	396	June 7, 2023
Dataset Viewer not available on features of type datasets.Array2D(shape=(None, 768), dtype='float64') 🤗Datasets	7	35	May 14, 2025

Appropriate yaml for dataset_info list[float]

Related topics