Hi @lhoestq,
As already discussed, we need to use datasets>=1.16
in order to push_to_hub()
and load_dataset()
a DatasetDict()
. This is very clear.
However, I checked if the features
of the DatasetDict()
are kept and it appears that not.
# download https://huggingface.co/datasets/lener_br
datasets = load_dataset('lener_br')
# check the features
datasets['train'].features
{'id': Value(dtype='string', id=None),
'ner_tags': Sequence(feature=ClassLabel(num_classes=13, names=['O', 'B-ORGANIZACAO', 'I-ORGANIZACAO', 'B-PESSOA', 'I-PESSOA', 'B-TEMPO', 'I-TEMPO', 'B-LOCAL', 'I-LOCAL', 'B-LEGISLACAO', 'I-LEGISLACAO', 'B-JURISPRUDENCIA', 'I-JURISPRUDENCIA'], names_file=None, id=None), length=-1, id=None),
'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
# connect to HF hub
from huggingface_hub import notebook_login
notebook_login()
# push this DatasetDict() to my HF profile as private
datasets.push_to_hub(repo_id='test_lener_br', private=True)
# download the pushed DatasetDict()
datasets = load_dataset('pierreguillou/test_lener_br', use_auth_token=API_TOKEN)
# check the features
datasets['train'].features
{'id': Value(dtype='string', id=None),
'ner_tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
As you can see, I lost the ner_tags features
of the original DatasetDict()
. What do you think?