Loader for dataset with multiple source files in one split

Hi, I have a conll-format dataset where the training “split” comprises multiple files. I’m writing a loader based on the conll2003 loader but can’t figure out what the best way is to load in multiple files for a single split. SplitGenerator seems to pass just a single filepath on in gen_kwargs param. What’s the best way to tackle this, or a place to look for a good example?

(The dataset comes with multiple sections, A B E F G and H, which I’d like to be surfaceable through the loader, but the main things I’m struggling with is the best way to combine four of these files’ instances to form the training partition)

Hi ! You can just pass a list of files to gen_kwargs, and iterate over the files in _generate_examples :slight_smile:

1 Like