User embeddings based on sequence of events

ibobak · December 2, 2025, 9:52pm

Hi everyone. I work in ad-tech, and we’ve started investigating user embeddings using a Sequence-of-Events (SoE) approach - where embeddings are built not on aggregated features, but directly from raw user events.

We’ve already found a couple of promising papers, but it’s still hard for us to determine whether this approach will scale well to our use case (we handle hundreds of millions of users daily).

With this message, I would like to kindly ask anyone familiar with this topic to share suggestions - links to papers, web pages, approaches, relevant topics, GitHub repositories, anything. We want to build this the right way.

Thanks in advance.

John6666 · December 3, 2025, 2:48am

For now, resources.

Here is a “guide + reading list” for Sequence-of-Events (SoE) user embeddings in ad-tech, with enough context that you can use it as a starting spec inside your team.

1. What “user embeddings from sequences of events” really means

Traditional ad-tech user features:

You build aggregates like:
- clicks_last_7d, imps_last_30d, sports_interest_score, #conversions_past_90d.
These are:
- Orderless (you lose the fact that A happened before B).
- Often time-insensitive (yesterday vs 6 months ago may look similar).
- Labor-intensive to maintain for each objective and surface.

Sequence-of-Events (SoE) approach:

For each user you keep a time-ordered list of events:
- Impressions, clicks, conversions, searches, page views, app actions…
Each event carries:
- IDs (ad, item, campaign, publisher), context (device, geo, placement), time, etc.
You feed this sequence into a sequence model (RNN / Transformer).
The model outputs one or more user vectors = user embeddings.
These embeddings are then used for:
- Candidate retrieval (ANN, similarity graph).
- Ranking (CTR/CVR models).
- Lookalikes / clustering / personalization in other products.

In other words: instead of hand-crafting summaries of the log, you let a sequence model learn how to compress the log into a vector.

2. Does it scale to hundreds of millions of users?

Short answer: yes. Several companies are doing almost exactly what you describe at “hundreds of millions – billions of users per day” scale. The trick is how you architect it.

2.1 Meta: ALURE – async user embeddings for ads

Paper: Async Learned User Embeddings for Ads Delivery Optimization (ALURE).(arXiv)

What they do:

Learn user embeddings from sequence-based, multimodal user activities using a Transformer-like model.(arXiv)
Do this asynchronously for billions of users per day.
Build a user similarity graph from these embeddings and use it to retrieve ad candidates, combined with realtime signals in the main ads system.(arXiv)

Why this matters for you:

This is almost exactly “user embeddings based on SoE events for ads,” proven at Meta scale.
They explicitly decouple heavy sequence modeling into an offline/nearline pipeline; serving only uses precomputed embeddings + realtime features.

2.2 Alibaba / Taobao: long sequential user behavior for CTR

Core paper: Practice on Long Sequential User Behavior Modeling for CTR Prediction (MIMN + UIC).(arXiv)

What they say:

Long user sequences are important, but:

“system latency and storage cost increase approximately linearly with the length of user behavior sequence.”(arXiv)
They propose:
- MIMN (Multi-channel user Interest Memory Network) to summarize long histories.
- UIC (User Interest Center), a separate service that stores user interest vectors for each user.

Why this matters:

UIC is effectively a user embedding service built from sequences.
It’s deployed in Alibaba’s display ads system and handles sequences up to thousands of events per user.(arXiv)

Follow-up work:

SIM (Search-based User Interest Modeling) and ETA-Net improve how they search and attend over lifelong user histories (tens of thousands of behaviors) while staying within latency budgets.(arXiv)

Takeaway: they solved scaling issues by:

Separating interest modeling from CTR serving (UIC).
Using retrieval + efficient attention for very long histories rather than running a giant Transformer over everything at serve time.

2.3 Pinterest: TransAct – realtime + batch user sequence modeling

Paper: TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest.(arXiv)

Key design:

A realtime Transformer (TransAct) that encodes recent user actions.
Combined with batch-generated user embeddings that summarize long-term preferences.(arXiv)
Deployed to multiple large surfaces (Homefeed, Related Pins, Notifications, Search).(arXiv)
Public PyTorch repo: pinterest/transformer_user_action.(GitHub)

Why it matters:

Shows a concrete “hybrid” pattern:
- Offline SoE embeddings for long-term behaviour.
- Realtime sequence encoder for short-term intent.
They discuss practical things like time-window masking to avoid label leakage and match online conditions.(arXiv)

This is very close to what you’d do if you plug SoE user embeddings into an existing ad ranking stack.

2.4 Kuaishou: TWIN – lifelong user sequences

Paper: TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou (KDD 2023).(arXiv)

Highlights:

Targets lifelong histories (behaviors over months/years; sequences of length (10^4–10^5)).(arXiv)
Uses two stages:
- A General Search Unit (GSU) over the long history.
- An Exact Search Unit (ESU) with attention over a small relevant subset.
Ensures consistent relevance metrics between GSU and ESU so the retrieval stage doesn’t filter out behaviors the attention stage cares about.(arXiv)

Takeaway: if you eventually want very long histories, you probably need a two-stage design like TWIN/SIM, not a single giant sequence model.

2.5 Tencent: AETN – general-purpose user embeddings from app usage

Paper: General-Purpose User Embeddings based on Mobile App Usage (Tencent, KDD 2020).(arXiv)

They:

Model sequences of app events: install, uninstall, retention, etc. (heterogeneous events).(arXiv)
Use an AutoEncoder-coupled Transformer Network (AETN) to learn general-purpose user embeddings.(arXiv)
Deploy these embeddings in multiple downstream applications (ads, recommendations, etc.) at Tencent scale.(arXiv)

Takeaway: SoE embeddings can be shared across many tasks and teams, not just one CTR model.

2.6 Survey confirmation

Survey: “A Survey on User Behavior Modeling in Recommender Systems” (IJCAI 2023).(arXiv)

Defines categories like Long-Sequence UBM and User-Behavior Retrieval-based methods.
Explicitly discusses industrial systems like MIMN/UIC, SIM, etc. as examples of long-sequence user modeling at scale.(IJCAI)

This gives you a good overview of where SoE user embeddings fit in the broader recommender literature.

3. What all these systems have in common

If you strip away the details, the successful large-scale systems share a few core ideas.

3.1 Heavy sequence modeling is not in the hot path

Instead of recomputing a big Transformer for every ad request:

Meta ALURE:
- Runs a Transformer-like model offline/nearline on user histories.
- Produces embeddings asynchronously for billions of users per day.
- Those embeddings are used later in retrieval + ranking.(arXiv)
Alibaba MIMN/UIC:
- UIC is a separate module that stores user interest vectors produced from long sequences.
- The main CTR model queries UIC; it doesn’t redo long-sequence modeling at request time.(arXiv)

Pattern:

Build a user embedding service that is updated offline/nearline, then reuse its outputs everywhere.

For you: this is how you get SoE richness without blowing up latency.

3.2 Long histories are managed with windows or two-stage retrieval

Naïve idea: “just feed all 10,000 events into a big Transformer.”
Reality: too slow and too costly at ad serving QPS.

What people actually do:

Use a recent window (e.g., last 100–300 events) for the main encoder.
For lifelong histories, use two-stage methods:
- SIM and TWIN: fast search over the full history, then attention over a small subset.(arXiv)

This keeps complexity manageable while still benefiting from long-term behavior.

3.3 Hybrid offline (long-term) + realtime (short-term)

TransAct is the cleanest example:

Batch user embeddings (long-term) + realtime Transformer features (short-term).(arXiv)

ALURE also combines async user embeddings with realtime user activity when retrieving ads.(arXiv)

Pattern:

Offline part:
- SoE encoder over long window, updated every X minutes/hours.
Realtime part:
- Small model or features over last few events in the current session.

For your scale, I would assume this hybrid structure from the beginning.

3.4 Specialized infra for big embedding tables and sequences

Two libraries you’ll see referenced:

Transformers4Rec (NVIDIA Merlin):
- Open-source library for sequential and session-based recommendation using Transformers.(arXiv)
- Integrates with NVTabular (preprocessing) and Triton (inference) to build GPU-accelerated pipelines end-to-end.(ACM Digital Library)
TorchRec (Meta):
- PyTorch domain library for large-scale recommendation, with primitives for sharded embedding tables and distributed training/inference.(IJCAI)

At hundreds of millions of users, the bottleneck is often the embedding infrastructure, not the sequence model itself. Using one of these stacks (or building something similar) is strongly recommended.

4. A practical blueprint for your ad-tech use case

Below is a simplified but realistic step-by-step plan.

4.1 Step 1 – Start small and narrow

Pick:

1–2 high-impact ad surfaces (e.g., feed ads on web + app).
1 main objective (CTR or CVR).

Build a dataset:

For each user, collect the last 100–200 events (impressions, clicks, key site actions).
For each event, include:
- Item/ad ID, campaign ID, advertiser ID.
- Event type (imp/click/conv).
- Basic context (device, country, placement, page type).
- Time info (timestamp bucket + time since previous event).

This gives you a clean SoE representation to experiment with.

4.2 Step 2 – Train a modest SoE encoder

Choose a simple model first:

A 2–3 layer Transformer or GRU with:
- Embedding dim ~64–128.
- Input length 100–200 events.

Train it to:

Predict click/no-click (or next event) given the history up to time (t).

From this model, define the user embedding as:

The final hidden state, or
An attention-pooled summary over the sequence.

Run offline comparisons:

Old model (aggregated features) vs new model (aggregated + user embedding).
Look at AUC / log-loss / NDCG improvements.

Goal: prove value offline and debug data/leakage issues.

4.3 Step 3 – Turn it into a user embedding service

Once you have a good encoder:

Run it in a batch/nearline job:
- Every X minutes/hours, update user_id → embedding for active users.
Store embeddings in a sharded key-value store (or whatever storage you already use for features).

Then update your online stack:

On each ad request:
- Look up user embedding.
- Feed it, together with existing features, into your CTR/CVR ranker.
Optionally:
- Start using it as a query vector in an ANN index to retrieve candidate ads/items.

At this point, you have a real SoE-based user representation in production.

4.4 Step 4 – Add a small realtime head

When the basics are stable:

Add a compact realtime sequence model over the last few events in the current session (e.g., last 10–20 events).
- This can be a tiny Transformer or GRU.
Have the ranker take:
- Long-term user embedding (from batch job).
- Short-term session embedding (from realtime head).
- Ad/item features + context.

This is effectively a simplified TransAct-style hybrid design.(arXiv)

4.5 Step 5 – Only then think about full lifelong histories

Once:

The SoE user embedding service works,
The hybrid ranker is stable and shows lift,

you can consider:

Extending sequence length,
Introducing two-stage retrieval (SIM/TWIN style) for lifelong histories.

This is where TWIN and SIM/ETA become relevant.(arXiv)

I would not start there for a first deployment.

5. Curated reading list and repos (short, opinionated)

If you want a “minimum set” of things to read and show colleagues:

5.1 Directly relevant industrial papers

Meta – ALURE
Async user embeddings from sequence-based activities for billions of users per day, powering a user similarity graph for ad retrieval.(arXiv)
Alibaba – MIMN + UIC
“Practice on Long Sequential User Behavior Modeling for CTR Prediction”: introduces MIMN and UIC, a separate interest center service to handle long sequences efficiently.(arXiv)
Pinterest – TransAct
“Transformer-based Realtime User Action Model for Recommendation”: hybrid of batch user embeddings + realtime Transformer, deployed to Homefeed and other surfaces.(arXiv)
Kuaishou – TWIN
“TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction”: two-stage interest retrieval + attention for very long histories.(arXiv)
Tencent – AETN
“General-Purpose User Embeddings based on Mobile App Usage”: AutoEncoder + Transformer over app usage sequences, used across multiple downstream tasks.(arXiv)
Survey – He et al. 2023
“A Survey on User Behavior Modeling in Recommender Systems”: overview of conventional vs long-sequence vs retrieval-based user behavior models, including industrial systems.(arXiv)

5.2 Libraries and repos to look at

Transformers4Rec (NVIDIA Merlin) – GitHub
Library for sequential & session-based recommendation with Transformers, integrated with NVTabular and Triton for full pipelines.(arXiv)
TorchRec (Meta) – GitHub + docs
PyTorch library for large-scale recsys with sharded embedding tables and distributed training.(IJCAI)
pinterest/transformer_user_action – TransAct code
Example of a production-style Transformer user action model.(GitHub)

These give you concrete templates for how to structure and run SoE models.

6. Very short summary

Yes, SoE user embeddings do scale: Meta (ALURE), Alibaba (MIMN/UIC, SIM, ETA), Pinterest (TransAct), Kuaishou (TWIN), Tencent (AETN) all run SoE-style user modeling at “hundreds of millions / billions of users per day” scale.(arXiv)
The key patterns they share:
- Heavy sequence modeling is async/offline in a user embedding service.
- Long histories are handled with windows or two-stage retrieval + attention, not one huge model.
- They use hybrid offline (long-term) + realtime (short-term) representations.
- They rely on specialized infra like TorchRec and Transformers4Rec for big embedding tables and sequence modeling.(ar5iv)
A sensible path for you:
- Start with one surface + last 100–200 events + small Transformer/GRU.
- Turn it into a user embedding service and plug into existing CTR/ranking.
- Add a small realtime head.
- Only then explore full lifelong histories and two-stage architectures.

matthewjhones · December 4, 2025, 4:58am

This is an interesting use case, especially when user behavior is better represented as a sequence rather than a single aggregated vector. Treating events as ordered inputs and encoding them with a transformer or RNN-style approach can produce much richer user embeddings. A somewhat related discussion is here: https://discuss.huggingface.co/t/predict-next-embedding-given-sequence-of-embeddings/31845

Topic		Replies	Views
Sequence analysis Beginners	0	221	October 3, 2020
Temporal Information 🤗Transformers	0	314	October 8, 2020
Why are segment and position embeddings so large? Research	2	1570	August 2, 2020
Which model to use for suggesting article to the user based on details provided? Beginners	7	1890	May 28, 2021
Predict next embedding given sequence of embeddings Beginners	0	390	February 16, 2023