Hi, I am following the Hugging Face course. In chapter.5, the Datasets Library, under the topic “Big data? Datasets to the rescue” topic, I am trying to execute the below code in Colab notebook as detailed in the course.
In response, I am getting the below error, for which I am unable to find a solution anywhere on the web including Github and Stackoverflow. Appreciate if someone can look into and let me know the resolution. Thank you.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-65-42004c060c27> in <module>
2 from datasets import interleave_datasets
3
----> 4 combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
5 list(islice(combined_dataset, 2))
2 frames
/usr/local/lib/python3.8/dist-packages/datasets/features/features.py in _check_if_features_can_be_aligned(features_list)
2052 for k, v in features.items():
2053 if not (isinstance(v, Value) and v.dtype == "null") and name2feature[k] != v:
-> 2054 raise ValueError(
2055 f'The features can\'t be aligned because the key {k} of features {features} has unexpected type - {v} (expected either {name2feature[k]} or Value("null").'
2056 )
ValueError: The features can't be aligned because the key meta of features {'meta': {'case_ID': Value(dtype='string', id=None), 'case_jurisdiction': Value(dtype='string', id=None), 'date_created': Value(dtype='string', id=None)}, 'text': Value(dtype='string', id=None)} has unexpected type - {'case_ID': Value(dtype='string', id=None), 'case_jurisdiction': Value(dtype='string', id=None), 'date_created': Value(dtype='string', id=None)} (expected either {'pmid': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None)} or Value("null").
Hi! This error means that the features of the interleaved datasets are not the same, and they have to be for interleaving. To fix it, make sure pubmed_dataset_streamed.features is equal to law_dataset_streamed.features before the interleave_datasets call.
Thanks @mariosasko for the suggestion. Wondering why this is not mentioned in the course! Nevertheless, how do I equate the features in both the datasets; can you give me the syntax? Thank you.
@mariosako. I am sorry, I am unable to try your suggestion because now I am encountering a value error while loading the dataset using the streaming option itself.
If one cannot see the list of the features of an IterableDataset then it’s difficult to know which features to drop.
Coincidentally, i can see the features of the pubmed_dataset_streamed.features.
You can not see the columns of law_dataset_streamed by calling .features method is because law_dataset_streamed is an iterable dataset object (because of streamed=True parameter in the load_dataset() statement) and to explore the elements of an Iterable Dataset we need to iterate over it. We can access the columns of our streamed dataset as follows as the generator yields a dictionary:
This helps and solves the problem. However, it looks like we shall first install zstandard before importing datasets library for everything to work correctly.
I can indeed see the features of the IterableDataset as i can see it for pubmed_dataset_streamed.features so i don’t understand why it shouldn’t work for law_dataset_streamed
Hi @Shamik! The former dataset has a loading script in which we can define its features in advance, but the latter doesn’t since it reads data from arbitrary local/remote files. Assuming all the rows have the same columns, you can call resolve_features() on it to fix this (otherwise, you can pass features to load_dataset to manually specify the features).