I am trying to add Turkic Machine Translation dataset that is introduced by this paper and stored in there to be able to use it easily. I downloaded all sets and using this script to pre-process:
import os
from tqdm.notebook import tqdm
from datasets import Dataset, DatasetDict
lang_combs = set()
splits = ['train', 'dev', 'test/bible', 'test/ted', 'test/x-wmt']
for split in splits:
lang_combs.update(os.listdir(split))
lang_combs = sorted(list(lang_combs))
datasets = {}
for lang_comb in tqdm(lang_combs):
dataset = {}
for split in splits:
split_data = {"translation": []}
src_lang, tgt_lang = lang_comb.split('-')
src_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{src_lang}")
tgt_file = os.path.join(split, lang_comb, f"{src_lang}-{tgt_lang}.{tgt_lang}")
if os.path.exists(src_file) and os.path.exists(tgt_file):
with open(src_file, 'r') as src, open(tgt_file, 'r') as tgt:
src_sents = [line.strip() for line in src.readlines()]
tgt_sents = [line.strip() for line in tgt.readlines()]
for src_sent, tgt_sent in zip(src_sents, tgt_sents):
split_data["translation"].append({
src_lang: src_sent,
tgt_lang: tgt_sent
})
split = split.replace('/', '-')
dataset[split] = Dataset.from_dict(split_data)
dataset_dict = DatasetDict(dataset)
datasets[lang_comb] = dataset_dict
final_dataset = DatasetDict(datasets)
Then, when I tried to push it to the Hub, I am getting this error:
TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’
The authors of original paper have Turkic X-WMT dataset but this generated from script (turkic_xwmt.py). And what is the difference between pushing it in parquet files and writing script. As far as I know, it is suitable for the datasets that is accessible via an API. So, how can I solve this?
What you have is a dataset with multiple configurations. Each language combination is a different configuration and I suspect that the upload function still does not support pushing multiple configurations programmatically at the same time.
Thank you for your response, but I think the problem doesn’t solve with this. What I believe is the problem is that I am trying to push dataset that consisted of 405 DatasetDicts that each has train, dev and testDatasets. You see the error message?
TypeError: Values in DatasetDict should be of type Dataset but got type ‘<class ‘datasets.dataset_dict.DatasetDict’>’
Other translation datasets is created not by pushing all the datasets to the Hub but creating wrapper for original source. So, how can I push this dataset to the hub as parquet files or make a wrapper from Google Drive?