Auth suddenly required for model access .... sometimes

I have an app that uses Llama Index, QdrantVectorStore, fastembeed, etc. When we initialize it goes out to HF to download the sparse model. This has worked in the past. Now we are getting an error ‘Invalid credentials in Authorization header’ that we never had before. This prompted us to download the models offline into a cache and have the code pick it up from there. Running this snippet downloads the models without needing auth:

python3 -c "from fastembed import SparseTextEmbedding; import os; model_name = os.getenv('DEFAULT_SPARSE_EMBEDDING_MODEL_NAME'); cache_dir = os.getenv('FASTEMBED_CACHE_PATH'); \_ = SparseTextEmbedding(model_name=model_name, cache_dir=cache_dir)"

But when I try and use this, it still wants to go out to HF to get the sha: model_info(hf_source_repo).sha - and that fails with the same error.

First question I have is why does running that code to download the models not require auth?
Second question is why do I suddenly need auth and is there a way to use the locally cached models without auth?

1 Like

It appears there have been changes to Hub’s behavior between last year and this year. If you don’t modify existing code, I think the simplest approach is to use login.


Direct answers:

  1. Your one-liner does not require auth because it downloads public model files through code paths that allow anonymous access and use the local cache if present. No token gets attached, so the request succeeds. (Hugging Face)

  2. You suddenly “need auth” because a different code path is calling the Hub metadata API (e.g., HfApi().model_info(...).sha). If any Hugging Face token is present but invalid, huggingface_hub automatically adds it to requests. The server then returns 401 “Invalid credentials in Authorization header.” This started showing up more for users in 2025 and also whenever libraries added metadata lookups before load. You can avoid this either by removing the bad token, forcing offline/local-only, or passing token=None/False so the request is anonymous. (Hugging Face)

Background and context

How FastEmbed and the Hub interact

  • File downloads: hf_hub_download / snapshot_download fetch files and cache them. For public repos they work without any token. Cached files are then reused. Your one-liner uses this behavior. (Hugging Face)
  • Metadata calls: HfApi().model_info(repo_id) queries the Hub REST API for repo metadata and returns a commit SHA used for snapshotting. If your environment or keyring has a token, the library auto-attaches it. A stale or malformed token triggers 401 even for public repos. (Hugging Face)
  • FastEmbed sparse models like prithivida/Splade_PP_en_v1 are public and shown in Qdrant’s docs and examples, so anonymous file fetches and cached use work. (qdrant.tech)

Why it worked before and fails now

  • Environment drift: a token landed in HF_TOKEN / HUGGING_FACE_HUB_TOKEN or was saved by huggingface-cli login, turning formerly anonymous requests into authenticated ones. Env-var tokens override stored tokens. Date: docs updated through 2024–2025. (Hugging Face)
  • Library changes: newer releases or upstream code paths started hitting model_info(...) to resolve revisions or SHAs before load. Users began reporting 401s in Mar–Jul 2025 when an invalid token was present. (GitHub)

Make cached models work without auth

Pick one. All are valid.

A) Remove or neutralize the token

  • Unset env vars and logout so calls are anonymous:

    # remove token influence
    unset HF_TOKEN HUGGING_FACE_HUB_TOKEN
    huggingface-cli logout   # optional: clears saved token
    

    Then HfApi().model_info(repo, token=None) or token=False keeps requests unauthenticated. (Hugging Face)

B) Force full offline

  • Set offline mode before imports:

    export HF_HUB_OFFLINE=1
    

    This makes file loaders use the cache and makes any HfApi call raise OfflineModeIsEnabled; avoid or guard metadata calls in that mode. (Hugging Face)

C) Local-only loading via library flags

  • Where supported, pass local_files_only=True in the FastEmbed wrapper so underlying Hub downloads never run. Sparse support lagged in 2024, then matured; by 2025 maintainers confirm the flag is propagated. (GitHub)

D) Bypass the Hub API entirely

  • If your stack queries model_info(...).sha, skip it by pinning a known commit for the model in your config or load by absolute local path when your FastEmbed version supports it. Snapshotting by commit avoids revision resolution. (Hugging Face)

Concrete patterns

Minimal environment for local-only load

# cache is prewarmed already
export FASTEMBED_CACHE_PATH=/models/fastembed_cache
export DEFAULT_SPARSE_EMBEDDING_MODEL_NAME=prithivida/Splade_PP_en_v1
export HF_HUB_OFFLINE=1
unset HF_TOKEN HUGGING_FACE_HUB_TOKEN
# docs: HF offline + caching; SPLADE usage
# https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables
# https://qdrant.tech/documentation/fastembed/fastembed-splade/
from fastembed import SparseTextEmbedding
import os

model = SparseTextEmbedding(
    model_name=os.environ["DEFAULT_SPARSE_EMBEDDING_MODEL_NAME"],
    cache_dir=os.environ["FASTEMBED_CACHE_PATH"],
    # local_files_only=True  # if your fastembed version supports it
)

(Hugging Face)

Keep online, but anonymous

from huggingface_hub import HfApi
sha = HfApi().model_info("prithivida/Splade_PP_en_v1", token=None).sha

This prevents attaching a bad token. (Hugging Face)

LlamaIndex + Qdrant hybrid with local SPLADE
LlamaIndex’s Qdrant hybrid example runs SPLADE locally via FastEmbed; combine it with the env fixes above. (LlamaIndex)

Operational checklist

  • Verify cache is populated under your configured cache dir; the Hub stores refs, snapshots (by commit), and blobs. Date: docs current. (Hugging Face)
  • If behind a firewall or air-gapped: expect hangs unless you set HF_HUB_OFFLINE=1 or local_files_only=True; this is a known issue pattern. Date: Apr 30, 2024 report; later fixes propagate the flag. (GitHub)
  • If you must authenticate: ensure a valid token and scopes; many 401 reports in 2025 were token mistakes. (Hugging Face Forums)

Short, high-signal resources

Issues to watch

  • FastEmbed 401s to Hub, reports beginning Mar 14, 2025. (GitHub)
  • Firewall/offline hangs without local_files_only (Apr 30, 2024). (GitHub)
  • Sparse local_files_only gap (Sep 26, 2024), later propagation (Aug 6, 2025). (GitHub)

Docs

  • HF download + cache layout and snapshot behavior. Updated 2024–2025. (Hugging Face)
  • HF env vars and offline mode. Updated 2024–2025. (Hugging Face)
  • LlamaIndex Qdrant hybrid SPLADE example. Updated 2025. (LlamaIndex)
  • Qdrant FastEmbed SPLADE guide. Updated 2024–2025. (qdrant.tech)
1 Like

Wow - thanks for the very detailed response.

1 Like