How to create a new large Dataset on disk?

I have my own library which processes corpora of documents much larger than what fits into memory.

I would like to create a new HF Dataset by incrementally adding/streaming new examples to it, but because of the number of examples the dataset would not fit into memory.

I am not sure if add_item should get used for this as it does not seem to modify the dataset in place but rather returns a new dataset which looks like the dataset would actually be immutable and add_item would create a new dataset each time?

What is the proper way to do this and how would one go about doing this in parallel (have several processes convert documents to examples in the dataset in parallel)?

Hi ! Right now to create a dataset from a generator you need to use a dataset script. You can find some documentation here: Create a dataset loading script

In your case what matters is how to implement _generate_examples: in this method you can get the new examples to add and yield them. During generation, the dataset is written to disk without filling up your memory (a buffer is flushed every N examples).

You can also implement parallelism directly in _generate_examples if you want

Thanks - this looks really very complicated and complex just to create a dataset!
Moreover, if I understand the docs correctly, the _generate_examples method is meant to process some file where the filepath is configured somewhere else first.

So I do not understand how all this could be used to just perform these basic steps somehow: configure the and create a new Array Dataset, then write examples to that dataset until we run out (assuming I am getting the data needed to create the examples from some stream myself, so I do not know in advance how many I will get).

Basically what I would need is the proper equivalent of write(arraydataset, "w") as fp; fp.write(example) for array datasets.

I guess the API must be able to do this somehow, as it is one of the many steps done in the process you describe.

I see ! Then you can use the ArrowWriter :slight_smile:


In [1]: from datasets.arrow_writer import ArrowWriter                                                                                                                                                   

In [2]: with ArrowWriter(path="tmp.arrow") as writer: 
   ...:     writer.write({"a": 1}) 
   ...:     writer.write({"a": 2}) 
   ...:     writer.write({"a": 3}) 
   ...:     writer.finalize() 
   ...:                                                                                                                                                                                                 

In [3]: from datasets import Dataset                                                                                                                                                                    

In [4]: ds = Dataset.from_file("tmp.arrow")                                                                                                                                                             

In [5]: ds[:]                                                                                                                                                                                           
Out[5]: {'a': [1, 2, 3]}

In [6]:

You can even write several examples at once using

batch = {"a": [4, 5, 6]}
writer.write_batch(batch)

Thanks a lot! Can I ask a follow-up question?

I originally thought I could do this like so:

from datasets import Dataset, Features, ClassLabel, Sequence, Value
# define the features
features = Features(dict(
    id=Value(dtype="string"),
    tokens=Sequence(feature=Value(dtype="string")),
    ner_tags=Sequence(feature=ClassLabel(names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'])),
))
# empty: no instances
empty = dict(
    id = [],
    tokens = [],
    ner_tags = [],
)
ds = Dataset.from_dict(empty, features=features)
ds.save_to_disk(dataset_path="debug_ds1")
ds = Dataset.load_from_disk(dataset_path="test_ds1")
# NOTE: now would like to add examples to ds, so they get written to disk, but that does not 
# seem to work efficiently, as ds looks to be immutable? 
# At least ds.add_item(..) does not modify ds but seems to create a new DS???

So I realized that add_item(..) returns a new instance and so does not seem to be the right way to create content or add a lot of content. Is that right? I am not even sure what this method actually does: it definitely does not modify the dataset, but is the new dataset it creates always in-memory? If not, where is the backing arrow file?

In any case, based on your suggestion I am now creating an empty dataset using the code above and then I add the examples like this, and it seems to work:

from datasets.arrow_writer import ArrowWriter                                                                                                                                                   

with ArrowWriter(path="debug_ds1/dataset.arrow") as writer: 
    writer.write(dict(id=0, tokens=["a", "b"], ner_tags=[0, 0])) 
    writer.write(dict(id=1, tokens=["x", "y"], ner_tags=[1, 0])) 
    writer.finalize() 
    
ds2 = Dataset.load_from_disk(dataset_path="debug_ds1")
len(ds2)

Is this sufficient to create a proper Dataset or could there be problems because writing directly to the arrow file does maybe not update some of the fields in the json files stored alongside the arrow file?

This is why I originally thought writing should go over the Dataset instance somehow …

Somehow this approach looks “hacky” to me - I would have expected that there is a cleaner API to achieve just that: define the metatadata (which features and how they are defined) for an on-disk dataset, then write the instances to the dataset being sure that all internal metadata is correctly set. Where by “dataset” I simply mean something that
can be opened from local disk using Dataset.load_from_disk(dataset_path="debug_ds1") but has the data memory mapped.

A Dataset object is a wrapper of an immutable Arrow table. When an Arrow table is loaded from your disk, you can’t add new items to it. This is because it’s not possible to append new rows to an already existing Arrow table on your disk.

When you do ds.add_item(), it actually creates an in-memory table containing your new element, and the resulting dataset contains the concatenation of the Arrow table on disk, and your new item in memory.

We’re exploring the addition of a new method Dataset.from_generator that would take a python generator function, write the examples on disk, and then return the memory-mapped dataset. Would that be useful in your case ?

Yes, something like Dataset.from_generator(Iterable[Dict], features=Features, ...) would be extremely useful!
Want! Want! Want! Want!
Background: we want to use HF as a DL backend to our gatenlp NLP framework and this means that we need to convert a stream/corpus of documents to a HF Dataset, where that Dataset could be so big that it does not fit into memory. So currently, the only approach I see (if I understand things correctly) is to first export “manually” as e.g. JSON and then do Dataset.from_json(..) and maybe cache the data and/or save as an Array table backed dataset if it should get used many times?

1 Like

Cool ! You know this already but for future references, we started to discuss about it here: how to convert a dict generator into a huggingface dataset. · Issue #4417 · huggingface/datasets · GitHub

Background: we want to use HF as a DL backend to our gatenlp NLP framework and this means that we need to convert a stream/corpus of documents to a HF Dataset, where that Dataset could be so big that it does not fit into memory. So currently, the only approach I see (if I understand things correctly) is to first export “manually” as e.g. JSON and then do Dataset.from_json(..) and maybe cache the data and/or save as an Array table backed dataset if it should get used many times?

Sounds good yes, or use the ArrowWriter to write an Arrow table on disk from your generator, and then load the Arrow table with Dataset.from_file

1 Like

I assume that would be faster than going via JSON? But it would also first create a file (an Arrow table in this case) and then actually create another file from that first file, basically duplicating it?

I guess the Dataset.from_generator( Iterable[Dict], features=Features, ...) method would avoid that intermediate creation of a temporary file and directly create the dataset I need?

Also, would it not be better to call this Dataset.from_iterable really, because there is no reason why you would want to restrict the parameter to generators (every generator is an iterable, but not every iterable is a generator). But if you actually do allow iterables, then from_generator would be a misleading name.

The plan is mostly to ask for a generator function. It’s probably the easiest way to ask for possibly large data without filling up your RAM. TensorFlow Datasets already implements from_generator this way.

Then from_list is also a good one for in memory data.