Struggling to find Web snapshot datasets before 2001

Huytic · October 9, 2025, 7:29pm

Hello.

I’m an entry-level researcher trying to find datasets contain pre-2001 content of Website snapshots. The URL link lists also work for me. I have been search for few days and the results mostly returns to early-TREC datasets like WT2g, or some resources from national libraries or alexa crawls which are likely not able to be downloaded.

I’m a very beginner to this work. So any suggestions or resources?

Thanks.

John6666 · October 10, 2025, 12:19am

It’s a difficult challenge because so much data has physically vanished…

Huytic · October 12, 2025, 3:48pm

Thank you soooo much for your answer. Yeah it is a tough job.

Btw, there are various source of NCSA Mosaic “What’s New” (1993–1994) , such as ‘https://www.aber.ac.uk/~dcswww/Public/Misc/mosaic-docs/old-whats-new/‘

John6666 · October 12, 2025, 9:44pm

There seem to be many mirror sites.

There are multiple dependable mirrors of the NCSA Mosaic “What’s New” lists. Here is the context, where to get them, and how to use them as high-quality seeds for pre-2001 web reconstruction.

What these pages are

Curated announcement lists maintained at NCSA in 1993–1994 describing “recent changes and additions” on the Web, with submission instructions and the maintainer address, plus month-by-month archives linked from the top page. They are primary, contemporaneous records of new servers and notable resources coming online in mid-1993 through early 1994. (W3C)

Reliable mirrors you can cite and scrape

W3C mirror (landing + month links). Includes inline text noting the service purpose and the whats-new@ncsa.uiuc.edu contact, and links to month pages for June–December 1993 and January–March 1994. Use this when you need a canonical index of months. (W3C)
Aberystwyth Computer Science mirror (per-month files). Individual month pages such as December 1993 and January 1994 with dated entries and many original hostnames. Good for bulk extraction because files are in a simple, consistent HTML. (aber.ac.uk)
University of Utah (Math) mirror. Another long-standing mirror of the same “What’s New” content; useful redundancy if one mirror has gaps or encoding issues. (The University of Utah)
NTUA SoftLab mirror. A copy bundled with Mosaic 2.4 documentation; helpful as a third cross-check for text variants. (softlab)

Background: These pages sit in the historical context of Mosaic’s 1993 take-off at NCSA, which helps explain why the lists quickly became the de-facto public index of “new on the Web” sites. (NCSA)

How to use them effectively

Scope and dating. Treat each page as a monthly log of announcements. The W3C index exposes month links; the per-month mirrors carry day-stamped entries. Use those stamps to assign a first-seen month for each URL. (W3C)
Extraction. Parse all <a href> targets from each month page. Keep the literal hostnames as seen. Do not auto-fix protocols or trailing slashes; store both raw and normalized forms. Aber month files are especially uniform for scraping. (aber.ac.uk)
Redundancy and verification. Mirrors sometimes differ in encoding or minor text. Prefer W3C for the month index and meta text, and Aber for month content. Fall back to Utah or NTUA if a page 404s or has character-set problems. (W3C)
Convert to seeds. Deduplicate by host and by exact URL, then feed the list into capture enumeration for ≤2000 windows using the Wayback CDX API (from=19930601, to=20001231, filter=mimetype:text/html, collapse=digest). This turns 1993–1994 “What’s New” links into pre-2001 mementos you can actually download. (W3C)
Provenance notes in papers. When you cite, point to the W3C landing page for definition and policy text, then the exact month page you used for the seed, and mention at least one independent mirror used for cross-check. (W3C)

Known limitations and fixes

Not every new site was listed. Inclusion was by manual submission and editorial choice. Treat it as a curated sample, not a census. Cross-validate coverage with the CDX index for the same months. (W3C)
Dead or moved links. Many anchors resolve only via archives. Use the CDX API against the extracted URLs to find earliest captures and replay them. (W3C)
Encoding noise. Some mirrors have minor encoding glitches. If a page fails to parse cleanly, try another mirror; do not silently “fix” text. (The University of Utah)

Minimal workflow you can adopt

Start at the W3C “What’s New” index to enumerate months. 2) For each month, scrape the Aber page to extract URLs and dates. 3) Cross-check any failures on Utah or NTUA mirrors. 4) Feed the seed list to the CDX API and collect pre-2001 capture timestamps. 5) Fetch original bytes with .../web/{timestamp}id_/... and write WARC if you need a portable corpus. (W3C)

Topic		Replies	Views
What does the wikipedia dataset with the specific language and date mean? 🤗Datasets	1	740	May 5, 2022
Where is the source to benchmark's dataset entries on the model's website Beginners	2	382	August 10, 2020
Question about loading wikipedia datset 🤗Datasets	2	2371	November 11, 2020
New dataset added_review for improvement 🤗Datasets	1	533	December 15, 2021
Create wikitext2 dataset offline 🤗Datasets	0	637	May 24, 2022