Building a dataset file for machine translation and add it to Huggingface Datasets

AdWeeb · May 22, 2021, 7:48am

Hi all,
I am trying to add a dataset for machine translation for Dravidian languages (South India). However, I am facing an error that I cannot resolve at the moment. If anyone could help, it would be great. The dataset is stored in a csv format if anyone would like to have a look at it. [https://drive.google.com/file/d/1MJjjE5ieQ1xygMhqs_VYD0veZlBq794y/view?usp=sharing]
(https://drive.google.com/file/d/1MJjjE5ieQ1xygMhqs_VYD0veZlBq794y/view?usp=sharing)

The code for the dataset is as follows: 
      # coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Machine Translation in Dravidian languages"""

import datasets
import csv
import pandas as pd
_HOMEPAGE = "https://competitions.codalab.org/competitions/27650"

_CITATION = """ \
@inproceedings{chakravarthi-etal-2021-findings-shared,
    title = "Findings of the Shared Task on Machine Translation in {D}ravidian languages",
    author = "Chakravarthi, Bharathi Raja  and
      Priyadharshini, Ruba  and
      Banerjee, Shubhanker  and
      Saldanha, Richard  and
      McCrae, John P.  and
      M, Anand Kumar  and
      Krishnamurthy, Parameswari  and
      Johnson, Melvin",
    booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
    month = apr,
    year = "2021",
    address = "Kyiv",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.dravidianlangtech-1.15",
    pages = "119--125",
    abstract = "This paper presents an overview of the shared task on machine translation of Dravidian languages. We presented the shared task results at the EACL 2021 workshop on Speech and Language Technologies for Dravidian Languages. This paper describes the datasets used, the methodology used for the evaluation of participants, and the experiments{'} overall results. As a part of this shared task, we organized four sub-tasks corresponding to machine translation of the following language pairs: English to Tamil, English to Malayalam, English to Telugu and Tamil to Telugu which are available at https://competitions.codalab.org/competitions/27650. We provided the participants with training and development datasets to perform experiments, and the results were evaluated on unseen test data. In total, 46 research groups participated in the shared task and 7 experimental runs were submitted for evaluation. We used BLEU scores for assessment of the translations.",
}
"""

_DESCRIPTION = """\
The goal of this task is to improve access and the production of information for monolingual speakers of Dravidian languages, it is necessary to have machine translation. The shared task aims to promote research towards this goal.
"""

_LICENSE = "Creative Commons Attribution 4.0 International Licence"

_URLS = {
    "malayalam-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/file/d/11A0kPNhrv9xd1ZwqCCw9HltfnsVvm_o9/view?usp=sharing",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/file/d/1MJjjE5ieQ1xygMhqs_VYD0veZlBq794y/view?usp=sharing",
    },
    "tamil-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/file/d/1-2E-bdG_Ad_npW_QFKSUON2LeH1gENsX/view?usp=sharing",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/file/d/1-24hW9E_GiyvAaeN-utR_97ghrFwnn9i/view?usp=sharing",
    },
    "telugu-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/file/d/1-7v7HUbT1bGMIs7THc2XnUb0rFK0wRHO/view?usp=sharing",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/file/d/1-2tjohkFrqFj0qnMho22yZiVbTrgZ34_/view?usp=sharing",
    },
    "tamil-telugu": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/file/d/1-AvUG7RAeNPuFOQDIGa-xEL8FbtxNBHZ/view?usp=sharing",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/file/d/1-AIIHxdbSTSru01pvNPiMe3UPldnz033/view?usp=sharing",
    },
}


class DravidianMT(datasets.GeneratorBasedBuilder):
    """ Machine Translation in Dravidian languages dataset"""

    VERSION = datasets.Version("1.0.0")

    BUILDER_CONFIGS = [
        datasets.BuilderConfig(
            name="malayalam-english",
            version=VERSION,
            description="This part of the dataset covers the translation between Malayalam and English",
        ),
        datasets.BuilderConfig(
            name="tamil-english",
            version=VERSION,
            description="This part of the dataset covers the translation between Tamil and English",
        ),
        datasets.BuilderConfig(
            name="telugu-english",
            version=VERSION,
            description="This part of the dataset covers the translation between Telugu and English",
        ),
        datasets.BuilderConfig(
            name="tamil-telugu",
            version=VERSION,
            description="This part of the dataset covers the translation between Tamil and Telugu",
        ),
    ]

    def _info(self):

        if (
                self.config.name == "malayalam-english"
        ):  # This is the name of the configuration selected in BUILDER_CONFIGS above
            features = datasets.Features(
                {
                    "translation": datasets.features.Translation(
                        language_pair=("en", "ml")
                    ),
                }
            )

        elif self.config.name == "tamil-english":
            features = datasets.Features(
                {
                    "translation": datasets.features.Translation(
                        language_pair=("en", "ta")
                    ),
                }
            )
        elif self.config.name == "telugu-english":
            features = datasets.Features(
                {
                    "translation": datasets.features.Translation(
                        language_pair=("en", "te")
                    ),
                }
            )
        # else self.config.name == "tamil-telugu"
        else:
            features = datasets.Features(
                {
                    "translation": datasets.features.Translation(
                        language_pair=("ta", "te")
                    ),
                }
            )

        return datasets.DatasetInfo(
            # This is the description that will appear on the datasets page.
            description=_DESCRIPTION,
            # This defines the different columns of the dataset and their types
            features=features,  # Here we define them above because they are different between the two configurations
            # If there's a common (input, target) tuple from the features,
            # specify them here. They'll be used if as_supervised=True in
            # builder.as_dataset.
            supervised_keys=None,
            # Homepage of the dataset for documentation
            homepage=_HOMEPAGE,
            # License for the dataset if available
            license=_LICENSE,
            # Citation for the dataset
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        """ Returns SplitGenerators"""

        my_urls = _URLS[self.config.name]

        train_path = dl_manager.download_and_extract(my_urls["TRAIN_DOWNLOAD_URL"])
        test_path = dl_manager.download_and_extract(my_urls["TEST_DOWNLOAD_URL"])

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "filepath": train_path,
                    "split": "train",
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={"filepath": test_path, "split": "test"},
            ),
        ]

    def _generate_examples(self, filepath, split):
        """ Generate Dravidian MT examples"""

        with open(filepath, encoding="utf-8") as f:
            reader = csv.DictReader(f, quoting=csv.QUOTE_NONE)
            for idx, row in enumerate(reader):

                if self.config.name == "tamil-english":
                    result = {"translation": {"en": row["en"], "ta": row["ta"]}}
                    
                elif self.config.name == "malayalam-english":
                    result = {"translation": {"en": row["en"], "ml": row["ml"]}}
                
                elif self.config.name == "telugu-english":
                    result = {"translation": {"en": row["en"], "te": row["te"]}}
                
                else:
                    result = {"translation": {"ta": row["ta"], "te": row["te"]}}

                yield idx, result

lhoestq · May 25, 2021, 9:15am

Hi ! This is because you provide URLs to see the file on google drive, not download them.

You can fix this by changing the urls to download urls:

_URLS = {
    "malayalam-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=11A0kPNhrv9xd1ZwqCCw9HltfnsVvm_o9&export=download",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1MJjjE5ieQ1xygMhqs_VYD0veZlBq794y&export=download",
    },
    "tamil-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-2E-bdG_Ad_npW_QFKSUON2LeH1gENsX&export=download",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-24hW9E_GiyvAaeN-utR_97ghrFwnn9i&export=download",
    },
    "telugu-english": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-7v7HUbT1bGMIs7THc2XnUb0rFK0wRHO&export=download",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-2tjohkFrqFj0qnMho22yZiVbTrgZ34_&export=download",
    },
    "tamil-telugu": {
        "TRAIN_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-AvUG7RAeNPuFOQDIGa-xEL8FbtxNBHZ&export=download",
        "TEST_DOWNLOAD_URL": "https://drive.google.com/u/0/uc?id=1-AIIHxdbSTSru01pvNPiMe3UPldnz033&export=download",
    },
}

Topic		Replies	Views
A service to translate datasets into other languages 🤗Datasets	1	860	June 6, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5078	July 10, 2021
Problem with Hugging face customised SQuad dataset Beginners	4	27	January 21, 2025
Can you add Kalmyk Language to dataset card languages? 🤗Datasets	2	12	June 5, 2025
I uploaded a dataset through huggface web interface. But i can't load it! 🤗Datasets	3	1001	May 14, 2023

Building a dataset file for machine translation and add it to Huggingface Datasets

Related topics