How to add new image to existing dataset?

nielsr · August 4, 2022, 12:26pm

Let’s say you have a dataset on the hub, containing some images:

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/example-documents")

How can I add a new image to it?

I tried:

from PIL import Image

image = Image.open("/content/cord_example.png").convert("RGB")

new_image = {'image': image}
dataset['test'] = dataset['test'].add_item(new_image)

but this fails with:

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB size=576x864 at 0x7F7CCC4589D0> with type Image: did not recognize Python value type when inferring an Arrow data type

nielsr · August 4, 2022, 12:35pm

Update: seems like I figured it out. Solution:

import datasets

feature = datasets.Image(decode=False)
new_image = {'image': feature.encode_example(image)}
dataset['test'] = dataset['test'].add_item(new_image)

lhoestq · August 5, 2022, 11:36am

I think feature encoding should be done inside add_item, do you mind opening an issue on github ?

jthorbur · August 6, 2023, 7:57pm

This thread was the only one I could find on this topic, so wanted to extend with some guidance the general topic of using ‘novel images’. Adding a novel image to the dataset typically means we then want to do something with it.

There are a couple of use cases here:

we want to add some new images to a dataset and then do a prediction for that image/those images
we just need to do a prediction for a novel image - we don’t need to add it to an existing dataset.

1. Add image(s) to existing dataset:

newImage = "someNewImage.jpeg"
feature = Image(decode=False) 
imageToAdd = {'image': feature.encode_example(newImage), 
              'label':0}
localDataset['test'] = localDataset['test'].add_item(imageToAdd)
# might want to add any number of new images
localDataset["test"].set_transform(val_transforms)

Note that we assume the existence of a val_transforms function here - maybe it would look something like this:

_val_transforms = Compose(
        [
            Resize(size),
            CenterCrop(size),
            ToTensor(),
            normalize,
        ]
    )
def val_transforms(examples):
    examples['pixel_values'] = [_val_transforms(image.convert("RGB")) for image in examples['image']]
    return examples

Now predict - let’s just look at the last image we added:

newNewThing = localDataset["test"][-1]
batchedNewThing = np.expand_dims(newNewThing, axis=0)
mostLikelyIndices = trainedModel.predict(batchedNewThing)
print(id2label[np.argmax(mostLikelyIndices.predictions)])

2. acquire a new image, and do a prediction for that image
Here we need to modify our transform function so that it works with a single entry.

def singleTransform(example):
    # note that _val_transforms already works with single entries
    example['pixel_values'] = _val_transforms(example["image"].convert("RGB"))
    return example

Now, open the image, transform and predict:

newImageBytes = PIL.Image.open(newImage)
imageToPredict = {'image': newImageBytes, 
              'label':0}
oneNewThing = singleTransform(imageToPredict)
mostLikelyIndices = trainedModel.predict(np.expand_dims(oneNewThing, axis=0))
print(id2label[np.argmax(mostLikelyIndices.predictions)])

A couple of general notes:

models expect batches, so when you want to pass a single entry into a model for prediction, you have to make it look like a ‘batch of one’, which is what np.expand_dims does for you.
when preprocessing is done for model inputs at train and test time, you should assume you’ll need to apply the same preprocessing when you want to predict for a novel item

Topic		Replies	Views
How to add a new column with the type 'image' to an existing dataset? 🤗Datasets	1	476	October 13, 2023
How to run image classification on image url 🤗Datasets	5	2675	July 21, 2022
Add new feature without changing number or rows 🤗Datasets	2	803	May 18, 2022
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2620	October 14, 2022
Using External Datasets with HuggingFace Data Loader Beginners	9	4410	April 27, 2022

How to add new image to existing dataset?

Related topics