Getting Value Error while using datasets_interleave_datasets method

Hi, I am following the Hugging Face course. In chapter.5, the Datasets Library, under the topic “Big data? Datasets to the rescue” topic, I am trying to execute the below code in Colab notebook as detailed in the course.

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])

list(islice(combined_dataset, 2))

In response, I am getting the below error, for which I am unable to find a solution anywhere on the web including Github and Stackoverflow. Appreciate if someone can look into and let me know the resolution. Thank you.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-42004c060c27> in <module>
      2 from datasets import interleave_datasets
      3 
----> 4 combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
      5 list(islice(combined_dataset, 2))

2 frames
/usr/local/lib/python3.8/dist-packages/datasets/features/features.py in _check_if_features_can_be_aligned(features_list)
   2052         for k, v in features.items():
   2053             if not (isinstance(v, Value) and v.dtype == "null") and name2feature[k] != v:
-> 2054                 raise ValueError(
   2055                     f'The features can\'t be aligned because the key {k} of features {features} has unexpected type - {v} (expected either {name2feature[k]} or Value("null").'
   2056                 )

ValueError: The features can't be aligned because the key meta of features {'meta': {'case_ID': Value(dtype='string', id=None), 'case_jurisdiction': Value(dtype='string', id=None), 'date_created': Value(dtype='string', id=None)}, 'text': Value(dtype='string', id=None)} has unexpected type - {'case_ID': Value(dtype='string', id=None), 'case_jurisdiction': Value(dtype='string', id=None), 'date_created': Value(dtype='string', id=None)} (expected either {'pmid': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None)} or Value("null").
3 Likes

Hi! This error means that the features of the interleaved datasets are not the same, and they have to be for interleaving. To fix it, make sure pubmed_dataset_streamed.features is equal to law_dataset_streamed.features before the interleave_datasets call.

Thanks @mariosasko for the suggestion. Wondering why this is not mentioned in the course! Nevertheless, how do I equate the features in both the datasets; can you give me the syntax? Thank you.

1 Like

You can drop the meta column (not needed for training) to equate them:

combined_dataset = interleave_datasets(
    [
        pubmed_dataset_streamed.remove_columns("meta"),
        law_dataset_streamed.remove_columns("meta")
    ]
)

(Please make sure you are using the latest release of datasets (2.8.0) before running this; check with import datasets; print(datasets.__version__))

2 Likes

@mariosako. I am sorry, I am unable to try your suggestion because now I am encountering a value error while loading the dataset using the streaming option itself.

Linking the issue: Getting Value Error while loading a dataset.. · Issue #5388 · huggingface/datasets · GitHub

This does indeed work!

However, i am interested to know why law_dataset_streamed.features doesn’t return anything.
I am instantiating the the law_dataset_streamed object as

law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)

If one cannot see the list of the features of an IterableDataset then it’s difficult to know which features to drop.
Coincidentally, i can see the features of the pubmed_dataset_streamed.features.

datasets.__version__ = 2.8

Thank you

You can not see the columns of law_dataset_streamed by calling .features method is because law_dataset_streamed is an iterable dataset object (because of streamed=True parameter in the load_dataset() statement) and to explore the elements of an Iterable Dataset we need to iterate over it. We can access the columns of our streamed dataset as follows as the generator yields a dictionary:

next(iter(pubmed_dataset_streamed)).keys()

dict_keys([‘meta’, ‘text’])

This helps and solves the problem. However, it looks like we shall first install zstandard before importing datasets library for everything to work correctly.

I can indeed see the features of the IterableDataset as i can see it for pubmed_dataset_streamed.features so i don’t understand why it shouldn’t work for law_dataset_streamed

Hi @Shamik! The former dataset has a loading script in which we can define its features in advance, but the latter doesn’t since it reads data from arbitrary local/remote files. Assuming all the rows have the same columns, you can call resolve_features() on it to fix this (otherwise, you can pass features to load_dataset to manually specify the features).

Thanks for your reply.
Have created a PR for the same Interleaving Datasets Bug Fix by Shamik-07 · Pull Request #478 · huggingface/course · GitHub