How to extract Images from Arrow datasets

Hello all,

I’m new working with Arrow format. And I just can’t find any support on how to extract the Image data from the arrow dataset I downloaded on HG.

I have the arrow file locally with the JSON file as well but can’t find a way to extract the data from it.

from datasets import Dataset
ds = Dataset.from_file(“path/to/data.arrow”)

→ doesn’t work as “TypeError: expected bytes, WindowsPath found”

1 Like

Here’s how you can do it:

python

Copy code

from datasets import Dataset

# Ensure the file path is a string
file_path = "path/to/data.arrow"  # Replace with your actual file path

# Load the dataset
ds = Dataset.from_file(file_path)

If you’re using Python’s pathlib to handle file paths, convert the Path object to a string before passing it to the from_file method:

python

Copy code

from datasets import Dataset
from pathlib import Path

# Define the file path using pathlib
file_path = Path("path/to/data.arrow")  # Replace with your actual file path

# Convert the Path object to a string
ds = Dataset.from_file(str(file_path))

Regarding the image data extraction, once you’ve successfully loaded the dataset, you can access the image data assuming the dataset contains an image column. The datasets library provides an Image feature to handle image data. If your dataset includes file paths to images, you can cast the relevant column to the Image feature to facilitate image processing:

python

Copy code

from datasets import Dataset, Image

# Load the dataset
ds = Dataset.from_file("path/to/data.arrow")

# Cast the image column to the Image feature
ds = ds.cast_column("image_column_name", Image())  # Replace 'image_column_name' with your actual column name

# Access an image
image = ds[0]["image"]

This approach will decode the image file into a PIL image object, allowing for further manipulation or analysis. For more detailed information on processing image data with the datasets library, refer to the official documentation.

2 Likes

Thank you very much @Alanturner2 !!!

Indeed the issues comes from using Path without converting it to a string !

1 Like