Appropriate yaml for dataset_info list[float]

We have the ability to specify dataset features in the README.md as yaml (Create a dataset card). The example below includes answers column which is a sequence (squad · Datasets at Hugging Face).

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32

What would be the appropriate yaml for a column that is a list of float? (e.g. an embedding per row). I’d like to use list instead of sequence. If I was defining it in a dataset build script it would look like this,

datasets.Features(
    {
        "id": datasets.Value("string"),
        "title": datasets.Value("string"),
        "context": datasets.Value("string"),
        "question": datasets.Value("string"),
        "vecs": [datasets.Value("float16")],
    }
)

this looks strange to me but seems to work,

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: vecs
    list:
      dtype: float16

Hi! You can verify the correct format with a Features._to_yaml_list method (use yaml.safe_dump(features.to_yaml_list()) to get the actual string).

1 Like

thanks! yea, looks like yaml.safe_dump(features._to_yaml_list()) gives

  - name: question
    dtype: string
  - name: vecs
    list: float16

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.