Thanks a lot! Can I ask a follow-up question?
I originally thought I could do this like so:
from datasets import Dataset, Features, ClassLabel, Sequence, Value
# define the features
features = Features(dict(
id=Value(dtype="string"),
tokens=Sequence(feature=Value(dtype="string")),
ner_tags=Sequence(feature=ClassLabel(names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'])),
))
# empty: no instances
empty = dict(
id = [],
tokens = [],
ner_tags = [],
)
ds = Dataset.from_dict(empty, features=features)
ds.save_to_disk(dataset_path="debug_ds1")
ds = Dataset.load_from_disk(dataset_path="test_ds1")
# NOTE: now would like to add examples to ds, so they get written to disk, but that does not
# seem to work efficiently, as ds looks to be immutable?
# At least ds.add_item(..) does not modify ds but seems to create a new DS???
So I realized that add_item(..)
returns a new instance and so does not seem to be the right way to create content or add a lot of content. Is that right? I am not even sure what this method actually does: it definitely does not modify the dataset, but is the new dataset it creates always in-memory? If not, where is the backing arrow file?
In any case, based on your suggestion I am now creating an empty dataset using the code above and then I add the examples like this, and it seems to work:
from datasets.arrow_writer import ArrowWriter
with ArrowWriter(path="debug_ds1/dataset.arrow") as writer:
writer.write(dict(id=0, tokens=["a", "b"], ner_tags=[0, 0]))
writer.write(dict(id=1, tokens=["x", "y"], ner_tags=[1, 0]))
writer.finalize()
ds2 = Dataset.load_from_disk(dataset_path="debug_ds1")
len(ds2)
Is this sufficient to create a proper Dataset or could there be problems because writing directly to the arrow file does maybe not update some of the fields in the json files stored alongside the arrow file?
This is why I originally thought writing should go over the Dataset instance somehow …