Read data from hdfs

Hi, I’m new to huggingface. For my current exeperience, the code I’ve read about shows that the data will be downloaded to local before next steps.
Is that possible to read stream data from remote source like hdfs? Since my training data mab be huge, it will exhaust my disk storage for downloading the data.
Will be thankful for any hint.

hey @gfork, i’ve never tried it myself but the datasets library let’s you process data with Apache Beam: Beam Datasets — datasets 1.5.0 documentation

perhaps that is suitable for your use case?

Thanks @lewtun, I’ll try that, and give you feedback if it works. :smile:

1 Like