Seems known failure mode:
Your failure mode is normal for long-span archives. The system is optimized for semantic topicality, but the user is asking for semantic topicality constrained by time. Recent temporal IR and temporal QA work treats those as different problems: systems need to detect temporal intent, normalize time expressions, and reason over evolving facts, because ordinary semantic retrieval does not reliably gather evidence that is both relevant and temporally coherent. (arXiv)
What is going wrong
In your pipeline, the timeline cue is extracted but then removed from candidate generation. That helps broad semantic matching, but on a 50-year corpus it also removes the only strong signal that distinguishes “XYZ political party in 2020” from “XYZ political party in 1993, 2008, or 2024.” The current diachronic RAG literature describes this directly: standard semantically driven retrieval has a blind spot on longitudinal queries because it fails to collect evidence that is both topically relevant and temporally coherent for the requested period. A recent paper reports that a time-aware retriever that separates subject from temporal window outperformed standard RAG by 13% to 27% on its benchmark. (arXiv)
How production systems usually handle “entity + time window” queries
Production systems usually do not rely on “semantic ANN first, then date cleanup” for explicit narrow windows. The common pattern is to treat time as a structured retrieval condition and push it into the search path itself. Weaviate documents this as pre-filtering with an allow-list passed into vector search. Milvus documents filtered search as restricting ANN search to entities matching the metadata condition before the vector stage runs. Pinecone’s metadata-filtering paper likewise emphasizes integrating filtering into the retrieval path for accuracy. (Weaviate Documentation)
The more mature pattern is not “always pre-filter” or “always shard.” It is query-adaptive execution. Recent filtered-ANN systems work formalizes three families: pre-filtering, post-filtering, and runtime or inline filtering, and argues that production behavior is shaped heavily by query planning and system architecture, not just by a single algorithmic choice. Vespa exposes this explicitly: it can switch between approximate nearest-neighbor search with filtering and exact nearest-neighbor search with pre-filtering depending on the estimated filter hit ratio. (arXiv)
So the production answer to your first question is:
- explicit, narrow window → hard filtering during candidate generation
- broad or fuzzy temporal intent → softer time features in ranking
- no temporal intent → standard retrieval
- strict window plus recall concerns → filtered branch plus a broader fallback branch (Weaviate Documentation)
Pre-filtering vs post-filtering
For your second question, pre-filtering or inline/runtime filtering is generally preferred over post-filtering for explicit, selective time windows. This is not just theory. Weaviate states two concrete disadvantages of post-filtering: you cannot predict how many valid results will remain after filtering, and if the filter is restrictive, the initial vector search may contain no valid match at all. A FAISS issue from 2020 describes exactly your problem in simpler form: if FAISS retrieves top hits first and you filter by date later, none of the requested top hits may fall in the desired date range. The VLDB tutorial on filtered vector search says the same thing in database terms: post-filtering often requires retrieving a multiple of K just to leave enough results after filtering, which complicates recall and tuning. (Weaviate Documentation)
That said, the best production answer is not “pre-filtering everywhere.” When filters are not selective, or when the estimate is uncertain, adaptive or runtime strategies can win. Qdrant says its planner chooses the strategy based on filter selectivity, indexes, and segment characteristics. Vespa documents thresholds that decide whether to use approximate neighbors with filtering, exact search with pre-filtering, or post-filtering heuristics. So the more precise answer is: pre-filter for narrow time windows, adaptive planning for mixed workloads. (Qdrant)
What I would do in your exact stack
I would not start with year-wise shards. I would first change candidate generation inside your existing FAISS setup. FAISS now supports subset search through SearchParameters and IDSelector. The wiki says you can select a subset of vectors by id and pass that selector through the sel field of SearchParameters. It also gives a very relevant optimization: for IndexIVF with sorted ids, IDSelectorRange can search only the relevant subsection of inverted lists, and the time-complexity note says this can be effectively O(k) rather than O(n) when ids are sorted appropriately. (GitHub)
That gives you a practical first move:
- keep one main IVFPQ index,
- map normalized time windows to allowed ids,
- use
SearchParametersIVF / IVFPQSearchParameters with sel,
- run filtered dense retrieval for explicit time queries,
- in parallel, run a smaller global fallback search,
- union, exact-rerank from memmap, then cross-encoder rerank. (GitHub)
This is lower risk than sharding and is closer to how production systems handle selective filters without multiplying operational overhead. It also directly addresses the specific failure you observed: the candidate pool becomes time-aware before reranking, instead of asking reranking to rescue a temporally mixed candidate set. (Weaviate Documentation)
Why I would add a fallback branch
Hard filtering by document timestamp alone can hurt recall when there is timestamp noise or when relevant evidence is retrospective. A 2024 article summarizing the 2020 election may still be useful for “what happened in 2020,” depending on your answering policy. That is why I would use two branches for explicit temporal queries: a time-filtered dense branch for precision, and a smaller global branch for recall protection. Vespa’s documentation is useful here because it explicitly warns that restrictive filters can return low-quality neighbors and suggests analyzing distance thresholds under filtered search. In other words, once filters get tight, you need a way to prevent the system from returning semantically weak but in-range material just because it passes the filter. (docs.vespa.ai)
Would I shard by year or period
Only if temporal queries are a large share of traffic and the windows are usually narrow. Partitioning by year or period is a real pattern, but it is most attractive when the filter dimension is dominant and stable. The filtered-vector-search literature notes that partitioned indexes are a reasonable option when a categorical filter is known in advance, but it also emphasizes that the optimal method depends on selectivity, correlation, k, and data distribution. For your current stack, subset search inside one FAISS index is the better first move because it fixes the retrieval logic without introducing multi-index routing and merge complexity.
A good compromise is coarse partitioning, not year-sharding everywhere. For example, use decade partitions or a few broad eras for routing, plus filtered retrieval inside the chosen partition, plus a small global fallback. That reduces search scope without forcing you to operate dozens of indexes. This is an inference from the system behavior described in the filtered-ANN literature and from how production systems adapt execution based on filter selectivity rather than using one static method. (arXiv)
What I would change beyond first-stage retrieval
I would keep your cross-encoder reranker, but I would stop treating time as only metadata fetched after retrieval. Add explicit temporal features into final ranking: inside-window bonus, distance-to-window penalty, exact year match, exact event alias match, and timestamp confidence. Temporal IR work emphasizes that the problem is not only retrieving by time but also ordering events and reasoning over evolving facts, so time should affect ranking as well as eligibility. (arXiv)
I would also distinguish publication time from event time when possible. In news and archive corpora, those are often different. A document published later may still describe events inside the target window. This is exactly the kind of ambiguity that temporal QA work identifies as a major challenge. If you can store both fields, filtered retrieval can use publication time, event time, or both depending on query intent. (arXiv)
How I would evaluate it
Do not judge this fix only by overall Recall@5. Add temporal metrics:
- in-window Recall@K
- in-window nDCG
- rank of first relevant in-window chunk
- answer accuracy on explicit temporal queries
- answer accuracy on implicit temporal queries
- latency by query class
For public evaluation references, ChronoQA is directly relevant: it is a temporal-sensitive RAG benchmark built from over 300,000 news articles and 5,176 questions with explicit and implicit time expressions. LongMemEval is also useful for temporal reasoning and knowledge updates, though it is broader than archive retrieval. (arXiv)
Direct answers
1. How do production systems typically handle “entity + time window” queries at scale without killing recall or latency?
By making time part of retrieval, not just part of post-processing. The common production pattern is filtered or inline ANN for narrow windows, often combined with adaptive planning and a broader fallback branch. Systems such as Weaviate, Milvus, Qdrant, Pinecone, and Vespa all document integrated filtering strategies rather than pure post-filter cleanup. (Weaviate Documentation)
2. Is pre-filtering generally preferred over post-filtering for time-constrained queries?
Yes, for explicit and selective windows. Post-filtering is usually weaker because it can discard too many candidates and leave you with unstable recall. But the strongest production answer is not static pre-filtering everywhere. It is adaptive filtered retrieval: pre-filter or inline filter when the time window is selective, and fall back to broader search or exact search when the planner estimates that strict filtering will hurt recall or result quality. (Weaviate Documentation)
My concrete recommendation is this: keep FAISS, add time-aware subset search with IDSelector, preserve a small global fallback branch, and only consider sharding after you measure how often temporal selectivity is high enough to justify operational complexity. That is the least disruptive path that aligns with both current production practice and recent temporal-RAG research. (GitHub)