Problem in pushing dataset to the Hub

murodbek · November 5, 2023, 2:50am

I am trying to add Turkic Machine Translation dataset that is introduced by this paper and stored in there to be able to use it easily. I downloaded all sets and using this script to pre-process:

import os
from tqdm.notebook import tqdm
from datasets import Dataset, DatasetDict

lang_combs = set()
splits = ['train', 'dev', 'test/bible', 'test/ted', 'test/x-wmt']

for split in splits:
    lang_combs.update(os.listdir(split))

lang_combs = sorted(list(lang_combs))

datasets = {}

for lang_comb in tqdm(lang_combs):
    dataset = {}

    for split in splits:
        split_data = {"translation": []}
        src_lang, tgt_lang = lang_comb.split('-')
        src_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{src_lang}")
        tgt_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{tgt_lang}")
        
        if os.path.exists(src_file) and os.path.exists(tgt_file):
            with open(src_file, 'r') as src, open(tgt_file, 'r') as tgt:
                src_sents = [line.strip() for line in src.readlines()]
                tgt_sents = [line.strip() for line in tgt.readlines()]

            for src_sent, tgt_sent in zip(src_sents, tgt_sents):
                split_data["translation"].append({
                    src_lang: src_sent,
                    tgt_lang: tgt_sent
                })

            split = split.replace('/', '-')
            dataset[split] = Dataset.from_dict(split_data)

    dataset_dict = DatasetDict(dataset)
    datasets[lang_comb] = dataset_dict

final_dataset = DatasetDict(datasets)

Then, when I tried to push it to the Hub, I am getting this error:

TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’

The authors of original paper have Turkic X-WMT dataset but this generated from script (turkic_xwmt.py). And what is the difference between pushing it in parquet files and writing script. As far as I know, it is suitable for the datasets that is accessible via an API. So, how can I solve this?

panigrah · November 5, 2023, 3:56am

I think this is the problem

github.com/huggingface/datasets

DatasetDict containing Datasets with different features when pushed to hub gets remapped features

opened 11:22AM - 25 Apr 22 UTC

closed 03:15PM - 20 May 22 UTC

pietrolesci

bug

Hi there, I am trying to load a dataset to the Hub. This dataset is a `Datase…tDict` composed of various splits. Some splits have a different `Feature` mapping. Locally, the DatasetDict preserves the individual features but if I `push_to_hub` and then `load_dataset`, the features are all the same. Dataset and code to reproduce available [here](https://huggingface.co/datasets/pietrolesci/robust_nli). In short: I have 3 feature mapping ```python Tri_features = Features( { "idx": Value(dtype="int64"), "premise": Value(dtype="string"), "hypothesis": Value(dtype="string"), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), } ) Ent_features = Features( { "idx": Value(dtype="int64"), "premise": Value(dtype="string"), "hypothesis": Value(dtype="string"), "label": ClassLabel(num_classes=2, names=["non-entailment", "entailment"]), } ) Con_features = Features( { "idx": Value(dtype="int64"), "premise": Value(dtype="string"), "hypothesis": Value(dtype="string"), "label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"]), } ) ``` Then I create different datasets ```python dataset_splits = {} for split in df["split"].unique(): print(split) df_split = df.loc[df["split"] == split].copy() if split in Tri_dataset: df_split["label"] = df_split["label"].map({"entailment": 0, "neutral": 1, "contradiction": 2}) ds = Dataset.from_pandas(df_split, features=Tri_features) elif split in Ent_bin_dataset: df_split["label"] = df_split["label"].map({"non-entailment": 0, "entailment": 1}) ds = Dataset.from_pandas(df_split, features=Ent_features) elif split in Con_bin_dataset: df_split["label"] = df_split["label"].map({"non-contradiction": 0, "contradiction": 1}) ds = Dataset.from_pandas(df_split, features=Con_features) else: print("ERROR:", split) dataset_splits[split] = ds datasets = DatasetDict(dataset_splits) ``` I then push to hub ```python datasets.push_to_hub("pietrolesci/robust_nli", token="<token>") ``` Finally, I load it from the hub ```python datasets_loaded_from_hub = load_dataset("pietrolesci/robust_nli") ``` And I get that ```python datasets["LI_TS"].features != datasets_loaded_from_hub["LI_TS"].features ``` since ```python "label": ClassLabel(num_classes=2, names=["non-contradiction", "contradiction"]) ``` gets remapped to ```python "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]) ```

What you have is a dataset with multiple configurations. Each language combination is a different configuration and I suspect that the upload function still does not support pushing multiple configurations programmatically at the same time.

murodbek · November 7, 2023, 1:46pm

Thank you for your response, but I think the problem doesn’t solve with this. What I believe is the problem is that I am trying to push dataset that consisted of 405 DatasetDicts that each has train, dev and test Datasets. You see the error message?

TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’

Other translation datasets is created not by pushing all the datasets to the Hub but creating wrapper for original source. So, how can I push this dataset to the hub as parquet files or make a wrapper from Google Drive?

panigrah · November 9, 2023, 12:26pm

It’s the same thing from what I understand. Doesn’t support pushing a dataset of dictionaries.

Anyway you can look at using the HFApi as an alternative

Topic		Replies	Views
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1571	August 23, 2021
Problem pushing dataset to huggingface 🤗Datasets	11	3608	June 26, 2023
How to create a Dataset Translators Beginners	0	83	May 9, 2024
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7326	October 20, 2023
Building a dataset file for machine translation and add it to Huggingface Datasets 🤗Datasets	1	1149	May 25, 2021

Problem in pushing dataset to the Hub

Related topics