Need to read subset of data files in WMT14

I need to load subset of dataset files in WMT14.
and TEST and VALIDATION sets.
How can I do it? I have tried using split function and data_files. I was unable to do it. Kindly help.

Hi ! I think you can just create your own wmt_14_gigafren_only dataset script.

To do so you can download the python files from wmt_14 here: datasets/datasets/wmt14 at master · huggingface/datasets · GitHub. Then rename to and edit this file to only keep gigafren in the _subsets part of the code.

Finally you just need to do load_dataset("path/to/") and you’re done :slight_smile:

Thanks @lhoestq , but this did not work. It asks to specify config ‘wmt14’ and language pair. Once I specify it ignores my script and downloads the complete dataset.

Hi ! can you share your script ? I’d be happy to help making it work

1 Like

Thanks for helping

import datasets

from .wmt_utils import Wmt, WmtConfig

_URL = ""
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale\v{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {}

_LANGUAGE_PAIRS = [(lang, "en") for lang in ["cs", "de", "fr", "hi", "ru"]]

class Wmt14(Wmt):
    """WMT 14 translation datasets for all {xx, "en"} language pairs."""

    # Version history:
    # 1.0.0: S3 (new shuffling, sharding and slicing mechanism).
        WmtConfig(  # pylint:disable=g-complex-comprehension
            description="WMT 2014 %s-%s translation task dataset." % (l1, l2),
            language_pair=(l1, l2),
        for l1, l2 in _LANGUAGE_PAIRS

    def manual_download_instructions(self):
        if self.config.language_pair[1] in ["cs", "hi", "ru"]:
            return "Please download the data manually as explained. TODO(PVP)"
        return None

    def _subsets(self):
        return {
            datasets.Split.TRAIN: [
#                 "commoncrawl",
#                 "multiun",
#                 "newscommentary_v9",
#                 "gigafren",
#                 "czeng_10",
#                 "yandexcorpus",
#                 "wikiheadlines_hi",
#                 "wikiheadlines_ru",
#                 "hindencorp_01",
            datasets.Split.VALIDATION: ["newsdev2014", "newstest2013"],
            datasets.Split.TEST: ["newstest2014"],

loading as


also tried


Your script works fine on my side, it only downloads these subsets: europarl_v7, newstest2013, newstest2014. I have a directory wmt14_gigafren that contains two files: and and I do

load_dataset("wmt14_gigafren", "fr-en")

Can you make sure you have the same directory and try again ?

1 Like

Thanks, I got it now. Its working. Thanks for the help.