Preprocessing WhatsApp for style cloning: How to handle session gaps and multi-message blocks?

I am preparing a dataset to fine-tune a model on a specific chat style (Person Y) using WhatsApp exports. Most scripts pair messages 1:1, which loses context when one person sends multiple messages in a row.

I’m training on an RTX 3060 12GB. Here is the logic I’m using for the pipeline:

Phase 1: Grouping & Sessions

  • Block Merging: Consecutive messages from the same sender are merged into one block. (X X X → User block, Y Y → Assistant block).

  • 60-Minute Gap: If a reply takes over an hour, it starts a new session_id.

  • Session Pairing: To avoid “hallucinated context,” I only pair a User block with an Assistant block if they share the same Session ID. If Y replies days later, that pair is skipped.

  • Cleaning: Stripping invisible Unicode characters (\u200e), <Media omitted>, and URLs.

Phase 2: Chunking

  • Word Limit: 500 words per block.

  • Sentence Splitting: If a block is over 500 words, it splits at the nearest sentence boundary (.!?) so thoughts aren’t cut in half.

Questions:

  1. Is 60 minutes a good threshold for a “conversation break” in personal chats? Though sometimes it has exceeded 1 hour but I have no idea what to do.

  2. When merging messages, is it better to join them with a space or a newline (\n) for the model to learn the cadence?

  3. Should I filter out low-signal pairs like “Ok” → “K”, or does that help the model sound more natural?

  4. For Llama 3/Mistral, is there a preferred format for this kind of multi-message block data?

Looking for feedback on the logic before I start the training run.

1 Like

Assuming you’re using HF’s Transformers, TRL’s Trainer, or Unsloth’s Trainer:


Bottom line

Your logic is substantially better than the usual naïve message[i] -> message[i+1] pipeline.

The two strongest parts are:

  • merging same-sender bursts into one turn, and
  • refusing to pair across long inactive gaps unless you believe the context is still real.

That matches both the way current chat fine-tuning tooling expects conversational data to look and what public WhatsApp-clone projects ended up discovering in practice. One Llama 3 WhatsApp-clone repo explicitly removed “strict alternation” validation because real message threads are often not perfectly user, assistant, user, assistant in raw exports. TRL’s SFT docs likewise assume a conversational dataset plus a chat template, not a flat pile of 1:1 text pairs. (GitHub)

The main change I would make before training is this:

Do not stop at single user-block → assistant-block pairs.
Build rolling multi-turn windows that end on a Person Y response. That gives the model both the target style and the conversational conditions under which that style appears. TRL is built to consume exactly that kind of structured conversational data. (Hugging Face)


1. Is 60 minutes a good threshold for a conversation break?

Yes, as a default. No, as a hard rule.

There is no universal “correct” inactivity threshold. Older work often used about 30 minutes, while a broader empirical study across multiple online activity domains argued that ~1 hour is a better general rule of thumb. Conversation-disentanglement research also supports the broader point that chronological adjacency is not the same thing as true conversational linkage. (arXiv)

For personal WhatsApp style cloning, I would treat 60 minutes as a soft boundary:

  • ≤ 60 min: same session by default
  • > 60 min: new session by default
  • 60–180 min: allow continuation only if there is strong evidence it is still the same thread

That evidence can be simple and cheap:

  • repeated names or entities,
  • obvious lexical overlap,
  • quote/reply continuation,
  • “about that / anyway / as I said” type continuation markers,
  • high embedding similarity between adjacent blocks.

Why this matters: in your use case, a false pair is more damaging than a skipped pair. If Y replies 9 hours later and you force that into a supervised X -> Y sample, you may teach the model a fake cause-and-effect relationship. Conversation disentanglement work exists precisely because message streams often contain hidden thread structure that simple adjacency misses. (arXiv)

What I would do in practice

Use a hybrid boundary rule, not time alone:

if gap <= 60 min:
    same_session = True
elif 60 < gap <= 180 min and semantic_overlap >= threshold:
    same_session = True
else:
    same_session = False

If you do not want semantic similarity yet, keep the simple 60-minute rule for the first run. It is a reasonable conservative default.

What about replies after many hours that are obviously still about the same thing?

Do not force them into ordinary paired chat SFT unless you can verify continuity. Either:

  • keep them only when your continuity heuristic says they belong together, or
  • store them separately as Y-only style text for optional later style adaptation.

That way you do not waste useful text, but you also do not contaminate the conversational supervision.


2. Space or newline when merging messages?

Use newline (\n) by default.

The reason is not that newline has mystical training power. The reason is that it preserves the micro-cadence of texting.

These two are not the same style signal:

where are you
2 mins
coming

versus

where are you 2 mins coming

Chat models are still just token predictors over a formatted sequence, and the model-specific chat template handles the outer conversation structure. Inside a same-sender block, though, you decide whether the burst structure survives preprocessing. Different chat models also use different control tokens and different chat templates, so preserving the inner cadence yourself is valuable. (Hugging Face)

My recommendation

  • join same-sender consecutive messages with single newline
  • preserve double newline only if the original text clearly had paragraphing
  • avoid flattening with spaces unless the messages are truly sentence fragments that obviously belong together

Why newline is better for style cloning

Because in messaging, style is often carried by:

  • stacked short sends,
  • one-line reactions,
  • delayed punchlines,
  • repeated questions,
  • pause/emphasis through line breaks,
  • “typing in bursts” behavior.

Those signals matter more for a WhatsApp clone than they would for ordinary instruction tuning.

Would I ever use a custom separator token?

Usually no. For a first run, plain newline is better than inventing a synthetic token such as <SAME_SENDER_BREAK>. Extra custom markers increase complexity and can become another thing the model overlearns.


3. Should you filter out low-signal pairs like Ok -> K?

Do not remove all of them. Downsample them.

Very short turns are part of authentic chat style. Public WhatsApp-style projects explicitly track short-message habits, emoji usage, slang, capitalization, punctuation, and response patterns because those small signals carry a lot of persona information. Other WhatsApp language-modeling projects also model message boundaries and message alternation as explicit structural cues rather than ignoring them. (GitHub)

But there is a real danger here: if your corpus contains too many generic low-information acknowledgments, the model can drift toward dull, under-informative replies.

The right rule

Keep short replies that are stylistically revealing, such as:

  • nahhh
  • wait
  • bro what
  • okayy
  • 😭
  • hmm
  • kk
  • weird punctuation or distinctive abbreviations

Downsample or cap short replies that are generic and repetitive, such as:

  • ok
  • k
  • yes
  • thanks
  • sure
  • repeated exact duplicates

A practical policy

Something like this works well:

  • if both sides are tiny and generic, keep only 10–25%
  • if the short reply contains distinctive spelling, emoji, punctuation, slang, or timing behavior, keep all
  • aggressively deduplicate exact repeated pairs

That preserves realism without letting the dataset collapse into “acknowledgment language.”

One nuance many people miss

If Person Y often sends links as part of their style, I would not always strip URLs completely. I would often replace them with a placeholder such as:

<URL>

That keeps the stylistic fact that “this person sends links here” without forcing the model to memorize raw URLs.

Likewise for media placeholders, I would consider replacing <Media omitted> with a neutral token like <MEDIA> rather than deleting it outright if media-sharing behavior is part of the person’s style.


4. Preferred format for Llama 3 / Mistral?

Yes: use structured conversational data, not a hand-written universal prompt string.

For current Hugging Face-style SFT, the safest format is:

{
  "messages": [
    {"role": "user", "content": "merged X block"},
    {"role": "assistant", "content": "merged Y block"}
  ]
}

And for better context retention:

{
  "messages": [
    {"role": "user", "content": "X1"},
    {"role": "assistant", "content": "Y1"},
    {"role": "user", "content": "X2"},
    {"role": "assistant", "content": "Y2"}
  ]
}

This matches the current chat-templating model: chat models expect a list of role/content messages which the tokenizer converts into the model’s required token sequence. TRL’s SFTTrainer is designed to work with conversational datasets and a chat template. (Hugging Face)

Why not hardcode one prompt format?

Because Llama and Mistral do not share one universal raw prompt syntax.

  • Transformers’ chat-template docs explicitly say different chat models may use different control tokens and formats, even when derived from similar base models. (Hugging Face)
  • Mistral-Instruct v0.1 explicitly documents the [INST] ... [/INST] format and says that format is available via apply_chat_template(). (Hugging Face)
  • Llama 3 Instruct is positioned as a dialogue/assistant model, so it belongs in the same “structured messages + model-native template” workflow rather than a custom manually composed prompt format. (Hugging Face)

So the safest workflow is:

  1. store your dataset as messages
  2. load the tokenizer for your target model
  3. let the tokenizer / trainer apply the model’s own chat template

That is more robust than writing your own “one format for all models.”


The biggest thing I would change in your pipeline

Move from isolated pairs to rolling multi-turn windows

Right now your pipeline sounds like it mainly produces:

User block -> Assistant block

That is good as a baseline, but I would upgrade it to:

[X1] -> [Y1]
[X1, Y1, X2] -> [Y2]
[X1, Y1, X2, Y2, X3] -> [Y3]

In other words: each training row should be a window of recent same-session turns ending with a Person Y message.

Why this matters:

  • style is context-sensitive,
  • short replies mean different things depending on prior turns,
  • sarcasm, warmth, bluntness, timing, and verbosity often appear only in context.

That is exactly the kind of structure conversational SFT is designed for. (Hugging Face)

A good first-window rule

Use the last 2–4 turns from the same session, ending on Y.

Not too short, because then the model loses conversational setup.
Not too long, because on your hardware long sequences are expensive.


Your 500-word chunking rule: good instinct, wrong unit

The instinct is right. The unit is not ideal.

For training, the real constraint is tokens after chat templating, not words. TRL’s SFT config is built around max_length, and sequences longer than that are truncated. The docs list max_length as the tokenized sequence cap and note that packing also uses it. (Hugging Face)

So I would change:

  • from 500 words per block
  • to token-based max length per training example

Better rule

  1. build the conversational messages
  2. apply the target tokenizer’s chat template
  3. count tokens
  4. trim to a token budget

For an RTX 3060 12GB, current Unsloth guidance says 7B/8B QLoRA is feasible in principle, but those VRAM figures are absolute minima and actual capacity depends heavily on batch size and sequence length. Their current table lists roughly 5 GB minimum for 7B QLoRA and 6 GB for 8B QLoRA, with a warning that higher batch sizes often cause OOM. (Unsloth)

What I would start with on your hardware

  • max_length = 1024 first
  • maybe 1536 if memory allows
  • batch size 1
  • gradient accumulation
  • QLoRA, not full 16-bit LoRA, for 7B/8B class models

That is much more realistic than thinking in 500-word blocks alone.


How I would split long content

If a training example is too long:

  1. keep the final Y response intact
  2. keep the most recent context turns
  3. drop the oldest turns first
  4. only split inside a block as a last resort

That matters because chat SFT is about turn structure, not just sentence boundaries.

Sentence-based splitting is better than raw hard cuts, but for dialogue the most meaningful boundary is usually the turn, not the sentence. So:

  • first split by removing old turns,
  • then, only if necessary, split a very long block at a sentence boundary.

Assistant-only loss: good idea, but verify it

For your case, assistant-only loss is conceptually the right target: you want the model to learn Person Y’s reply behavior, not to imitate the user side.

TRL documents this directly: for conversational datasets you can set assistant_only_loss=True, and the loss is computed only on assistant responses. But the docs also warn that this only works for chat templates that support assistant token masks via {% generation %} and {% endgeneration %}. There are also recent real-world reports of Llama 3 SFT runs failing with the exact error “at least one example has no assistant tokens” when the template does not produce the assistant mask correctly. (Hugging Face)

What that means in practice

Before the full run, test a few samples and verify:

  • the rendered chat template looks correct,
  • assistant spans are present,
  • assistant masks are nonzero,
  • truncation did not chop away the assistant span.

If that check fails, do not trust the training run just because it launches.


Packing: useful later, not for the first run

TRL supports packing and uses max_length for it. Packing can improve efficiency, but it also makes debugging harder, especially when you are already trying to validate masking and conversational structure. There have also been issue reports around packing and masking interactions. (Hugging Face)

For your first serious run, I would keep:

  • packing = False

Then only enable packing after:

  • the dataset format is stable,
  • assistant masking is verified,
  • and the baseline run behaves normally.

My recommended preprocessing policy for your exact use case

If I were preparing this dataset, I would use:

Phase 1: parse and clean

  • normalize timestamps and sender IDs
  • strip invisible control characters
  • replace raw URLs with <URL> instead of always deleting them
  • replace media placeholders with <MEDIA> if media behavior matters, otherwise drop them

Phase 2: merge bursts

  • merge consecutive same-sender messages into one block
  • join with \n, not spaces

Phase 3: sessionize

  • gap <= 60 min => same session
  • 60–180 min => same session only if continuity evidence is strong
  • > 180 min => new session

Phase 4: build examples

  • build rolling windows of the last 2–4 turns
  • only keep examples that end with a Person Y assistant turn
  • skip cross-session prompt→reply pairs

Phase 5: cap by tokens

  • render with the target tokenizer chat template
  • trim by token count, not word count
  • preserve the final Y response and nearest context first

Phase 6: filter

  • keep stylistically distinctive short replies
  • downsample generic short acknowledgments
  • deduplicate repeated tiny pairs aggressively

The format I would actually write to disk

For Llama 3 or Mistral, I would save JSONL like this:

{"messages":[
  {"role":"user","content":"where are you\n2 mins?"},
  {"role":"assistant","content":"coming\n5 min"},
  {"role":"user","content":"ok"},
  {"role":"assistant","content":"traffic 😭"}
]}

And not this:

{"prompt":"<custom handcrafted universal prompt>", "response":"..."}

The first form is easier to audit, easier to reuse across model families, and aligned with current Hugging Face conversational SFT expectations. (Hugging Face)


Direct answers to your four questions

1. Is 60 minutes good?

Yes, as a default.
Treat it as a soft boundary, not an absolute one. Use 60 minutes by default, and allow rare exceptions when there is strong evidence of topic continuity. (arXiv)

2. Space or newline?

Newline.
It preserves texting cadence and burst structure better than space. (Hugging Face)

3. Filter low-signal pairs?

Do not remove all of them. Downsample the generic ones.
Keep short replies that reveal style; cap repetitive acknowledgments. Public WhatsApp-style projects explicitly treat short-message patterns, emoji, slang, punctuation, and response habits as real style signals. (GitHub)

4. Preferred format for Llama 3 / Mistral?

Yes: messages with model-native chat templating.
Do not invent one universal prompt string. Use conversational data plus the target tokenizer’s apply_chat_template() path. (Hugging Face)


Final recommendation

I would approve your pipeline with these changes:

  • keep block merging,
  • keep the 60-minute rule,
  • merge with newlines,
  • skip cross-session pairs,
  • switch from isolated pairs to rolling multi-turn windows,
  • cap by tokens, not words,
  • downsample generic tiny acknowledgments,
  • store everything as conversational messages,
  • and verify assistant-mask behavior before the real run.

That is the highest-probability path to a model that sounds like Person Y in context, rather than a model that only imitates Person Y’s surface wording.

1 Like