Azure ML Datastore

Hello. My dataset is in Azure Blob Storage as a parquet. I am able to create a dataset(azureml) using Tablular.from_parquet_files. My question is how do I convert this to Huggingface dataset?
from datasets import load_dataset
dataset = load_dataset(“parquet”, data_files={‘train’: ‘train.parquet’, ‘test’: ‘test.parquet’})

Any general recommendation on how to create a Pytorch Dataloader for large parquet files in Azure?

Hi ! If your parquet files are public, you can load them using their HTTP urls in load_dataset. You can also stream the dataset using streaming=True (useful especially if your dataset is super big),
then you can pass your dataset directly to a PyTorch DataLoader (see documentation).

We don’t support accessing private azure blob storage yet, though there is an issue opened about it: Support cloud storage in load_dataset · Issue #5281 · huggingface/datasets · GitHub