Error when streaming FASTA files from the Hub?


Are FASTA files (.fna/.fasta) currently supported for streaming from the Hugging Face Hub? When trying to open a FASTA file (using file = open(filename, 'rb') a FileNotFoundError is thrown.

Are there any work-arounds for streaming this data format in a large dataset? Currently I can only use the dataset with streaming=False. Could this have to do with the way I am opening the file?


Take a look at Create an audio dataset. Check your path. Use fsspec if the path is remote.

open inside a dataset script is monkey-patched to support remote URLs, so this is not the issue. Debugging such issues without access to a dataset script is hard, so please share it here if possible (dummy data instead of real data is OK)

The error occurred when trying to stream the HG38 reference genome (or any fasta) that was hosted on the HF hub, and opened using pyFaidx by passing in the relative path in the repo. I couldn’t make this work out of the box, so I just downloaded the fasta file to cache even when streaming as a workaround.

Based on the pyFaidx source, a FASTA file can be streamed by opening it with"path/to/file") (our dataset streaming logic is built on top of fsspec) inside a dataset script with"path/to/file") (integrations usually support fsspec URLs, but not this one, which makes a bit harder to use)

1 Like