Creating and uploading dataset Huggingface Hub vs Dataset creation script

Hello,

Our team is in the process of creating (manually for now) a multilingual machine translation dataset for low resource languages. Currently, we have text files for each language sourced from different documents. The number of lines in the text files are the same. For example, for each document we have lang1.txt and lang2.txt each with n lines. Each line in lang1.txt maps to each line in lang2.txt. We currently have these text files in a Github repository.

Once we have completed this dataset curation (this will be actively on-going) we would like to upload to the Huggingface Hub. Essentially, we would like for this to be similar to glue with different configurations corresponding to different multi-lingual datasets.

When I was looking over the documentation page, I found several resources. One is the uploading to Huggingface Hub, another is creating a dataset loading script.

I’m confused about the function provided by these two methods. Also, I don’t if just uploading the raw text files that we have is enough to be able to work with.

Question: What is the easiest and most efficient way to upload the kind of dataset that I have?

Thank you.

Hi ! Yes you can upload your text files to the Hub. Then in order to define the configurations and how the examples must be read from the text files, you must also upload a dataset script in the same repository as your data.

If you don’t upload a dataset script, then the default dataset builder for .txt file is used (and basically it concatenates all the text data together).

If I understand correctly your dataset is a parallel dataset like flores, so I think to make your dataset script you can get some inspiration from the flores dataset script. Also feel free to read the documentation you mentioned about how to create a dataset script.

I’m the original poster. I forgot I already had an account and created another account to ask that question.

Thank you so much for your reply.

  1. Do we need more than one dataset script for each configuration or a single dataset script that manages everything?
  2. This repository can just be a Github repository? Is there a particular structure that needs to be followed by the repository? For example the place where the dataset script needs to be placed, its name, where the dataset itself should reside etc.

Thank you for your example. I will look at it and report back if I encounter any problems.

  1. Do we need more than one dataset script for each configuration or a single dataset script that manages everything?

Yes one script can be used to define all the configurations of your dataset. For example it’s common for parallel datasets to have one configuration per language pair, like in flores

  1. This repository can just be a Github repository? Is there a particular structure that needs to be followed by the repository? For example the place where the dataset script needs to be placed, its name, where the dataset itself should reside etc.

It must be a dataset repository on the Hugging Face Dataset Hub, see more information about dataset repository creation here. In terms of structure the only requirement is to name the dataset script with the same name as your dataset. Then you can place your data in a separate directory if you want. For example

my_dataset/
├── data/
│   ├── lang1.txt
│   └── lang2.txt
├── README.md
└── my_dataset.py

I hope that helps :slight_smile:

1 Like

If host our data and code to get it up and running on Github, we’ll have to host it on dataset repository as well? Or is it possible for the github repo to serve as the repository itself.

If I have to maintain two repos then I guess I need to make sure that both are synced up correctly.

If host our data and code to get it up and running on Github, we’ll have to host it on dataset repository as well? Or is it possible for the github repo to serve as the repository itself.

Yes I think so. Though it should be possible with a github action to maintain the repository on the Hugging Face side synced with the one on GitHub.

By reading and combing multiple sources for creating scripts datasets, I was able to whip up a script.

I have my setup like this:

  • Github hosts the files (.txts) in a repo where we have other scripts to automatically parse manually extracted and annotated data to put it in a folder within the repo called huggingface_hub. The links to these individual files will serve as the URLs
  • I have setup a dataset repository within Hugging Face hub which will host the dataset creation script, dummy data, datasets_info.json, and README.

However, before I get push the script to Hugging Face Hub and make sure it can download from the URL and work correctly, I wanted to test it locally. This is my dataset creation script:

#!/usr/bin/env python
import datasets, logging

supported_wb = ['ma', 'sh']

# Construct the URLs from Github. The URLs are a nested dictionary of the format:
# [language_pair][language][split]
_URL_PREFIX='https://raw.githubusercontent.com/ebegoli/rosetta-balcanica/main/dataset/huggingface_hub'
_URLs = {}
for lang in supported_wb:  
  lang_pair = f'en-{lang}'
  _URLs[lang_pair] = {
      'train': {
        'en': f'{_URL_PREFIX}/{lang_pair}/train_en.txt',
        f'{lang}': f'{_URL_PREFIX}/{lang_pair}/train_{lang}.txt',
        
      },
      'test': {
        'en': f'{_URL_PREFIX}/{lang_pair}/test_en.txt',
        f'{lang}': f'{_URL_PREFIX}/{lang_pair}/test_{lang}.txt',
      }
  }


class RosettaBalcanicaConfig(datasets.BuilderConfig):
  """BuilderConfig for Rosetta Balcanica
  """

  def __init__(self, wb_lang, **kwargs):
    super(RosettaBalcanicaConfig, self).__init__(
      name=f'en-{wb_lang}',
      description=f'Translation dataset from en to {wb_lang}',
      version=_VERSION,
      **kwargs
    )

    # validate language
    assert wb_lang in supported_wb, (f"Supported West Balkan languages are {supported_wb}, got {wb_lang}")
    self.wb_lang = wb_lang

class RoesettaBalcancia(datasets.GeneratorBasedBuilder):
  BUILDER_CONFIGS = [
    RosettaBalcanicaConfig(
      wb_lang=wb_lang,
    )
    for wb_lang in supported_wb
  ]

  def _info(self):
    source,target = 'en', self.config.wb_lang
    features = datasets.Features(
      {
        'id': datasets.Value('string'),
        'translation': datasets.features.Translation(languages=(source, target))
      }
    )

    return datasets.DatasetInfo(
      description=_DESCRIPTION,
      features=features,
      supervised_keys=None,
      homepage=_HOMEPAGE,
      citation=_CITATION,
    )

  def _split_generators(self, dl_manager):
      wb_lang = self.config.wb_lang
      lang_pair = f'en-{wb_lang}'

      data_dir = f'rosetta_balcanica/{lang_pair}'
      files = {}
      for split in ('train', 'test'):
        files[split] = {
          'en': f'{data_dir}/{split}_en.txt',
          f'{wb_lang}': f'{data_dir}/{split}_{wb_lang}.txt'
        }

      return [
        datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs=files['train']),
        datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs=files['test']),
      ]

  def _generate_examples(self, en_path, wb_path):
    wb_lang = self.config.wb_lang
    with open(en_path, encoding='utf-8') as f1, open(wb_path, encoding='utf-8') as f2:
      for sent_counter, (en_sent, wb_sent) in enumerate(zip(f1, f2)):
        en_sent = en_sent.strip()
        wb_sent = wb_sent.strip()
        result = (
          sent_counter,
          {
            'id': str(sent_counter),
            'translation': {
              'en': en_sent,
              f'{wb_lang}': wb_sent
            }
          }
        )
        yield result

I tried to test this script out in two ways:

  1. I used load_dataset directly and tried to see what results I got. Specifically, I did
from datasets import load_dataset
ds = load_dataset('rosetta_balcanica', 'en-ma')

If I had my dataset folder within the directory where I run this, I got this:

ds
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 32844
    })
})

I didn’t get the splits I wanted. I just got all the text files concatenated into one. It seems it didn’t even use the script at all. If I didn’t have the dataset directory where I ran this, then I just get a FileNotFoundError error.

  1. I used datasets-cli to test the script. This also required the location of the dataset. When I ran
datasets-cli test ../dataset --save_infos --all_configs

I got the following error:

Traceback (most recent call last):
  File "/home/sudarshan/anaconda3/envs/rose/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/sudarshan/anaconda3/envs/rose/lib/python3.7/site-packages/datasets/commands/datasets_cli.py", line 33, in main
    service.run()
  File "/home/sudarshan/anaconda3/envs/rose/lib/python3.7/site-packages/datasets/commands/test.py", line 142, in run
    for j, builder in enumerate(get_builders()):
  File "/home/sudarshan/anaconda3/envs/rose/lib/python3.7/site-packages/datasets/commands/test.py", line 139, in get_builders
    name=name, cache_dir=self._cache_dir, data_dir=self._data_dir, **module.builder_kwargs
TypeError: type object got multiple values for keyword argument 'name'

Again it didn’t matter where I ran this from, indicating that this also did not run the dataset creation script rosetta_balcanica.py.

Any help is appreciated in solving these problems!