How to use DownloadManager on Git LFS Files

I am trying to add a Huggingface Dataset that combines multiple data sources.

I have dataset files in ./data/ that look like the following:

  • ./data/dataset_A/
    • ./data/datasetA/datasetA_train.json
    • ./data/datasetA/datasetA_test.json
    • ./data/datasetA/datasetA.tar.gz
  • ./data/dataset_B/
    • ./data/dataset_B/datasetB_train.json
    • ./data/dataset_B/datasetB_test.json
    • ./data/dataset_B/datasetB.tar.gz

All the files are stored using git-lfs. If I run git lfs pull --include ./data/*/*.json and git lfs pull --include ./data/*/*.tar.gzfirst, DownloadManager.download(‘data/datasetA/datasetA_train.json’) works.

What if I have not used git-lfs to pull them locally though? Can I use the DownloadManager to load each of these files?

1 Like

The issue was I was not prepending the git lfs url with the repo url, i.e.

call DownloadManager.download(‘https://huggingface.co/datasets/MY_DATASET/resolve/main/data/datasetA/datasetA_train.json’)

The DownloadManager needs the git lfs Download File link (accessible if you inspect the LFS file in your repo).

1 Like