Azure ML Datastore

sahuabinash · February 15, 2023, 1:06am

Hello. My dataset is in Azure Blob Storage as a parquet. I am able to create a dataset(azureml) using Tablular.from_parquet_files. My question is how do I convert this to Huggingface dataset?
from datasets import load_dataset
dataset = load_dataset(“parquet”, data_files={‘train’: ‘train.parquet’, ‘test’: ‘test.parquet’})

Any general recommendation on how to create a Pytorch Dataloader for large parquet files in Azure?
Thanks

lhoestq · February 16, 2023, 11:44am

Hi ! If your parquet files are public, you can load them using their HTTP urls in load_dataset. You can also stream the dataset using streaming=True (useful especially if your dataset is super big),
then you can pass your dataset directly to a PyTorch DataLoader (see documentation).

We don’t support accessing private azure blob storage yet, though there is an issue opened about it: Support cloud storage in load_dataset · Issue #5281 · huggingface/datasets · GitHub

Topic		Replies	Views
Using Private cloud data to create datasets 🤗Datasets	0	32	July 15, 2024
Parquet load_dataset fails in AzureML VM. Works in Windows Desktop 🤗Datasets	1	365	February 18, 2023
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1081	July 27, 2023
How to publish a text to-image dataset on huggingface 🤗Datasets	1	58	February 9, 2025
Dataset Viewer for dataset with downloadable data 🤗Datasets	3	23	March 6, 2025

Azure ML Datastore

Related topics