Problem in pushing dataset to the Hub

I am trying to add Turkic Machine Translation dataset that is introduced by this paper and stored in there to be able to use it easily. I downloaded all sets and using this script to pre-process:

import os
from tqdm.notebook import tqdm
from datasets import Dataset, DatasetDict

lang_combs = set()
splits = ['train', 'dev', 'test/bible', 'test/ted', 'test/x-wmt']

for split in splits:
    lang_combs.update(os.listdir(split))

lang_combs = sorted(list(lang_combs))

datasets = {}

for lang_comb in tqdm(lang_combs):
    dataset = {}

    for split in splits:
        split_data = {"translation": []}
        src_lang, tgt_lang = lang_comb.split('-')
        src_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{src_lang}")
        tgt_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{tgt_lang}")
        
        if os.path.exists(src_file) and os.path.exists(tgt_file):
            with open(src_file, 'r') as src, open(tgt_file, 'r') as tgt:
                src_sents = [line.strip() for line in src.readlines()]
                tgt_sents = [line.strip() for line in tgt.readlines()]

            for src_sent, tgt_sent in zip(src_sents, tgt_sents):
                split_data["translation"].append({
                    src_lang: src_sent,
                    tgt_lang: tgt_sent
                })

            split = split.replace('/', '-')
            dataset[split] = Dataset.from_dict(split_data)

    dataset_dict = DatasetDict(dataset)
    datasets[lang_comb] = dataset_dict

final_dataset = DatasetDict(datasets)

Then, when I tried to push it to the Hub, I am getting this error:

TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’

The authors of original paper have Turkic X-WMT dataset but this generated from script (turkic_xwmt.py). And what is the difference between pushing it in parquet files and writing script. As far as I know, it is suitable for the datasets that is accessible via an API. So, how can I solve this?

I think this is the problem

What you have is a dataset with multiple configurations. Each language combination is a different configuration and I suspect that the upload function still does not support pushing multiple configurations programmatically at the same time.

Thank you for your response, but I think the problem doesn’t solve with this. What I believe is the problem is that I am trying to push dataset that consisted of 405 DatasetDicts that each has train, dev and test Datasets. You see the error message?

TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’

Other translation datasets is created not by pushing all the datasets to the Hub but creating wrapper for original source. So, how can I push this dataset to the hub as parquet files or make a wrapper from Google Drive?

It’s the same thing from what I understand. Doesn’t support pushing a dataset of dictionaries.

Anyway you can look at using the HFApi as an alternative