Just launched: https://triskeldata.au
Structured, cleaned, and tokenized AI training datasets no junk, no scraping, no bloat.
What’s Available (Full Access):
- Wikipedia – 26.1B tokens
- Reddit Comments – 13.0B tokens
- Reddit Submissions – 2.6B tokens
- PubMed – 5.8B tokens
- Project CodeNet – 6.1B tokens
- OpenAlex – 77.6B tokens
- Medical Journals – 354M tokens
- GeoNames (All Countries) – 835M tokens
Total: 132.389 Billion tokens all for under $200 USD
Why So Cheap?
- I’m covering hosting + processing costs only
- Datasets are cleaned, deduplicated,
.jsonl
, and ingestion-ready - Built to support the dev community not exploit it
License:
- Use for R&D, personal fine-tuning, private AI builds
- No resale, redistribution, or commercial deployment
Stop burning compute on messy junk.
Train on clean signal.