Iâm an entry-level researcher trying to find datasets contain pre-2001 content of Website snapshots. The URL link lists also work for me. I have been search for few days and the results mostly returns to early-TREC datasets like WT2g, or some resources from national libraries or alexa crawls which are likely not able to be downloaded.
Iâm a very beginner to this work. So any suggestions or resources?
There are multiple dependable mirrors of the NCSA Mosaic âWhatâs Newâ lists. Here is the context, where to get them, and how to use them as high-quality seeds for pre-2001 web reconstruction.
What these pages are
Curated announcement lists maintained at NCSA in 1993â1994 describing ârecent changes and additionsâ on the Web, with submission instructions and the maintainer address, plus month-by-month archives linked from the top page. They are primary, contemporaneous records of new servers and notable resources coming online in mid-1993 through early 1994. (W3C)
Reliable mirrors you can cite and scrape
W3C mirror (landing + month links). Includes inline text noting the service purpose and the whats-new@ncsa.uiuc.edu contact, and links to month pages for JuneâDecember 1993 and JanuaryâMarch 1994. Use this when you need a canonical index of months. (W3C)
Aberystwyth Computer Science mirror (per-month files). Individual month pages such as December 1993 and January 1994 with dated entries and many original hostnames. Good for bulk extraction because files are in a simple, consistent HTML. (aber.ac.uk)
University of Utah (Math) mirror. Another long-standing mirror of the same âWhatâs Newâ content; useful redundancy if one mirror has gaps or encoding issues. (The University of Utah)
NTUA SoftLab mirror. A copy bundled with Mosaic 2.4 documentation; helpful as a third cross-check for text variants. (softlab)
Background: These pages sit in the historical context of Mosaicâs 1993 take-off at NCSA, which helps explain why the lists quickly became the de-facto public index of ânew on the Webâ sites. (NCSA)
How to use them effectively
Scope and dating. Treat each page as a monthly log of announcements. The W3C index exposes month links; the per-month mirrors carry day-stamped entries. Use those stamps to assign a first-seen month for each URL. (W3C)
Extraction. Parse all <a href> targets from each month page. Keep the literal hostnames as seen. Do not auto-fix protocols or trailing slashes; store both raw and normalized forms. Aber month files are especially uniform for scraping. (aber.ac.uk)
Redundancy and verification. Mirrors sometimes differ in encoding or minor text. Prefer W3C for the month index and meta text, and Aber for month content. Fall back to Utah or NTUA if a page 404s or has character-set problems. (W3C)
Convert to seeds. Deduplicate by host and by exact URL, then feed the list into capture enumeration for â¤2000 windows using the Wayback CDX API (from=19930601, to=20001231, filter=mimetype:text/html, collapse=digest). This turns 1993â1994 âWhatâs Newâ links into pre-2001 mementos you can actually download. (W3C)
Provenance notes in papers. When you cite, point to the W3C landing page for definition and policy text, then the exact month page you used for the seed, and mention at least one independent mirror used for cross-check. (W3C)
Known limitations and fixes
Not every new site was listed. Inclusion was by manual submission and editorial choice. Treat it as a curated sample, not a census. Cross-validate coverage with the CDX index for the same months. (W3C)
Dead or moved links. Many anchors resolve only via archives. Use the CDX API against the extracted URLs to find earliest captures and replay them. (W3C)
Encoding noise. Some mirrors have minor encoding glitches. If a page fails to parse cleanly, try another mirror; do not silently âfixâ text. (The University of Utah)
Minimal workflow you can adopt
Start at the W3C âWhatâs Newâ index to enumerate months. 2) For each month, scrape the Aber page to extract URLs and dates. 3) Cross-check any failures on Utah or NTUA mirrors. 4) Feed the seed list to the CDX API and collect pre-2001 capture timestamps. 5) Fetch original bytes with .../web/{timestamp}id_/... and write WARC if you need a portable corpus. (W3C)