Hi everyone, I’ve been building a Spanish historical web corpus collected from the Internet Archive (Wayback Machine) covering 2002–2023, and I wanted to share it with the community. What makes it different: Most Spanish corpora focus on news and Wikipedia. This one goes deeper into categories that are virtually non-existent elsewhere: - Religion & Catholic traditions (Semana Santa, pilgrimages, cofradías) - Folklore & regional legends (Galician meigas, Basque basajaun, Celtic myths) - Esotericism & mysticism (astrology, tarot, occult Spanish web) - Conspiracies & pseudoscience — critical for misinformation detection - BOE legal texts — formal administrative Spanish since 2004 - Oposiciones exam materials — formal academic Spanish - Regional news from all 17 autonomous communities - Forums & colloquial Spanish (2003–2022) All records include automatic labeling: - topics, region, sentiment + score - linguistic era (web_1_0 → ia_era) - quality score (0–100), readability, lexical density - MD5 dedup hash Format: JSONL (Hugging Face compatible, auto-converted to Parquet) Available now: Pepere45 (Dang) More datasets coming this week (Wikipedia ES, religion, folklore, esotericism). Open to research collaborations, bulk licensing and custom extractions. Contact: info@spanishcorpusai.tech | https://spanishcorpusai.tech Happy to answer questions!
1 Like