How to extract Images from Arrow datasets

DimOnHF · December 27, 2024, 10:03am

Hello all,

I’m new working with Arrow format. And I just can’t find any support on how to extract the Image data from the arrow dataset I downloaded on HG.

I have the arrow file locally with the JSON file as well but can’t find a way to extract the data from it.

from datasets import Dataset
ds = Dataset.from_file(“path/to/data.arrow”)

→ doesn’t work as “TypeError: expected bytes, WindowsPath found”

Alanturner2 · December 27, 2024, 10:16am

Here’s how you can do it:

python

Copy code

from datasets import Dataset

# Ensure the file path is a string
file_path = "path/to/data.arrow"  # Replace with your actual file path

# Load the dataset
ds = Dataset.from_file(file_path)

If you’re using Python’s pathlib to handle file paths, convert the Path object to a string before passing it to the from_file method:

python

Copy code

from datasets import Dataset
from pathlib import Path

# Define the file path using pathlib
file_path = Path("path/to/data.arrow")  # Replace with your actual file path

# Convert the Path object to a string
ds = Dataset.from_file(str(file_path))

Regarding the image data extraction, once you’ve successfully loaded the dataset, you can access the image data assuming the dataset contains an image column. The datasets library provides an Image feature to handle image data. If your dataset includes file paths to images, you can cast the relevant column to the Image feature to facilitate image processing:

python

Copy code

from datasets import Dataset, Image

# Load the dataset
ds = Dataset.from_file("path/to/data.arrow")

# Cast the image column to the Image feature
ds = ds.cast_column("image_column_name", Image())  # Replace 'image_column_name' with your actual column name

# Access an image
image = ds[0]["image"]

This approach will decode the image file into a PIL image object, allowing for further manipulation or analysis. For more detailed information on processing image data with the datasets library, refer to the official documentation.

DimOnHF · December 27, 2024, 10:30am

Thank you very much @Alanturner2 !!!

Indeed the issues comes from using Path without converting it to a string !

Alanturner2 · December 27, 2024, 10:41am

I hoped your success in your project.
I am so happy. Cheer up! I hope your success!

Topic		Replies	Views
Dataset.from_dict() killed 🤗Datasets	0	140	July 8, 2024
[solved] How to load multiple arrow files into one dataset Beginners	4	2885	September 16, 2023
Turn of automatic Pil image generation in load_dataset 🤗Datasets	2	32	August 21, 2024
Load Dataset from arrow file 🤗Datasets	1	11209	October 27, 2022
ArrowTypeError in load_dataset 🤗Datasets	1	615	June 12, 2023

How to extract Images from Arrow datasets

Related topics