We have the ability to specify dataset features in the README.md as yaml (Create a dataset card). The example below includes answers
column which is a sequence (squad · Datasets at Hugging Face).
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
What would be the appropriate yaml for a column that is a list of float? (e.g. an embedding per row). I’d like to use list instead of sequence. If I was defining it in a dataset build script it would look like this,
datasets.Features(
{
"id": datasets.Value("string"),
"title": datasets.Value("string"),
"context": datasets.Value("string"),
"question": datasets.Value("string"),
"vecs": [datasets.Value("float16")],
}
)
this looks strange to me but seems to work,
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: vecs
list:
dtype: float16