Are FASTA files (.fna/.fasta) currently supported for streaming from the Hugging Face Hub? When trying to open a FASTA file (using
file = open(filename, 'rb') a FileNotFoundError is thrown.
Are there any work-arounds for streaming this data format in a large dataset? Currently I can only use the dataset with
streaming=False. Could this have to do with the way I am opening the file?
Take a look at Create an audio dataset. Check your path. Use fsspec if the path is remote.
open inside a dataset script is monkey-patched to support remote URLs, so this is not the issue. Debugging such issues without access to a dataset script is hard, so please share it here if possible (dummy data instead of real data is OK)
The error occurred when trying to stream the HG38 reference genome (or any fasta) that was hosted on the HF hub, and opened using pyFaidx by passing in the relative path in the repo. I couldn’t make this work out of the box, so I just downloaded the fasta file to cache even when streaming as a workaround.
Based on the
pyFaidx source, a FASTA file can be streamed by opening it with
fsspec.open("path/to/file") (our dataset streaming logic is built on top of
fsspec) inside a dataset script with
fsspec.open("path/to/file") (integrations usually support
fsspec URLs, but not this one, which makes a bit harder to use)