Pre training on a corpus of raw human thoughts : feasability and risks

Guillaume-Ransinan · May 16, 2025, 7:26am

Hi everyone.

Typicaly, most LLM are trained on highly polished, structured and edited text ( book, article, curated contents).

What would happen if a LLM were pretrained on a large corpus of “raw human thought” : brainstorming notes, half formed ideas, hesitations, stream of consciousness writing, and unedited drafts…
Before undergoing fine tuning or specialization ?

mahmutc · May 16, 2025, 12:52pm

The problem lies with the dataset—specifically its size and quality. Do you have access to trillions of tokens of ‘raw human thought’ for training?

Guillaume-Ransinan · May 16, 2025, 1:25pm

My idea is to use voluntaries from AI communities, users and devs.
Eventually using a llm by a prompt in the beginning
“Please answer in 1 or 2 short sentences maximum. Just briefly react, ask questions, or suggest alternative paths, without giving long or structured explanations.”.
It should be possible to collect lots of data.

mahmutc · May 16, 2025, 1:39pm

Alternatively, we could wait until something like Neuralink becomes part of everyday life.

Guillaume-Ransinan · May 16, 2025, 1:43pm

Lol. It should open vast new ways of thinking, with real meta brain. Life should find a way, like we need time to use the power of multi procedsors or multicore.

Less science fiction, I continue to present with humility my profane idea.

I’m exploring a multi-stage training architecture for LLMs, aiming to better align them with human cognitive patterns.

The idea is to move away from the standard path of training on highly structured, curated data from the start. Instead, the process would involve three distinct stages:

Minimal Pretraining — A small, clean corpus (basic texts, dialogues, simple Wikipedia) to give the model basic syntax and vocabulary without overfitting on highly formal data.
Core Training on Raw Human Thought — A medium-sized corpus built from volunteer-contributed notes, unedited reflections, fragmented ideas, doubts, contradictions, and partial reasoning. This step would expose the model to more realistic, non-linear human thinking patterns.
Fine-Tuning on Specialized Knowledge Domains — Traditional fine-tuning on structured, expert-level content (law, science, etc.) to provide accuracy and domain grounding after the model has learned to “think like a human”.

The aim is to develop a model that is both cognitively aligned (more flexible, exploratory) and still capable of handling structured, factual domains.

Challenges include catastrophic forgetting, corpus construction, and balancing expressive capacity with factual reliability.

I will be happy to read rough criticism of any readers, that’s how we can learn.

Topic		Replies	Views
📢 13 Critical Questions About LLMs – Seeking Insight and Collaboration Beginners	4	87	May 31, 2025
Strategies for Enhancing LLM's Understanding of a Complex Novel for Improved Question Answering Research	1	1310	January 19, 2024
Guidance on getting started with fine tuned uncensored model Beginners	2	1160	March 8, 2025
Newbie needs help: Training models on science literature Beginners	0	36	November 24, 2024
Separate LM fine tuning and classification head training Beginners	5	1860	July 1, 2021

Pre training on a corpus of raw human thoughts : feasability and risks

Related topics