Pre training on a corpus of raw human thoughts : feasability and risks

Hi everyone.

Typicaly, most LLM are trained on highly polished, structured and edited text ( book, article, curated contents).

What would happen if a LLM were pretrained on a large corpus of “raw human thought” : brainstorming notes, half formed ideas, hesitations, stream of consciousness writing, and unedited drafts…
Before undergoing fine tuning or specialization ?

1 Like

The problem lies with the dataset—specifically its size and quality. Do you have access to trillions of tokens of ‘raw human thought’ for training?

1 Like

My idea is to use voluntaries from AI communities, users and devs.
Eventually using a llm by a prompt in the beginning
“Please answer in 1 or 2 short sentences maximum. Just briefly react, ask questions, or suggest alternative paths, without giving long or structured explanations.”.
It should be possible to collect lots of data.

1 Like

Alternatively, we could wait until something like Neuralink becomes part of everyday life.

1 Like

Lol. It should open vast new ways of thinking, with real meta brain. Life should find a way, like we need time to use the power of multi procedsors or multicore.

Less science fiction, I continue to present with humility my profane idea.

I’m exploring a multi-stage training architecture for LLMs, aiming to better align them with human cognitive patterns.

The idea is to move away from the standard path of training on highly structured, curated data from the start. Instead, the process would involve three distinct stages:

  1. Minimal Pretraining — A small, clean corpus (basic texts, dialogues, simple Wikipedia) to give the model basic syntax and vocabulary without overfitting on highly formal data.

  2. Core Training on Raw Human Thought — A medium-sized corpus built from volunteer-contributed notes, unedited reflections, fragmented ideas, doubts, contradictions, and partial reasoning. This step would expose the model to more realistic, non-linear human thinking patterns.

  3. Fine-Tuning on Specialized Knowledge Domains — Traditional fine-tuning on structured, expert-level content (law, science, etc.) to provide accuracy and domain grounding after the model has learned to “think like a human”.

The aim is to develop a model that is both cognitively aligned (more flexible, exploratory) and still capable of handling structured, factual domains.

Challenges include catastrophic forgetting, corpus construction, and balancing expressive capacity with factual reliability.

I will be happy to read rough criticism of any readers, that’s how we can learn.

1 Like