Conceptually similar resources seem to exist.
Below is a focused, structured answer to your four questions, tailored to your concrete idea:
- A = you, L2 English (Mexican accent)
- B = “native” English version of you (e.g., ElevenLabs)
- C = B − A (accent delta)
- Then use A + k·C, B + k·C, Spanish + k·C, etc., inside a TTS/VC system such as XTTS-v2.
I will first answer the four questions concisely, then give more context, and finally a practical “how to prototype this” section.
0. Short answers to your 4 questions
-
Has anyone tried “accent delta vector” subtraction/addition?
-
Are accent and timbre separable enough in these spaces?
- In standard speaker embeddings (x-vector, ECAPA, Resemblyzer, XTTS speaker encoder), accent is clearly encoded, but it is entangled with timbre, prosody, and channel. (Frontiers)
- Specialized models (DART, Multi-Scale Accent Modeling) can disentangle accent and speaker reasonably well using extra structure (multi-level VAEs, vector quantization, adversarial objectives), but off-the-shelf embeddings will not give a perfectly clean “accent only” direction. (arXiv)
- Practically: your C will likely change accent and some aspects of timbre/style.
-
Existing pipelines that can handle embedding manipulation without collapsing?
- XTTS-v2 explicitly uses an external speaker encoder and d-vectors to condition TTS; its design is compatible with feeding custom or pre-computed embeddings. (arXiv)
- Coqui TTS tooling supports pre-computing d_vectors and passing them into models like VITS/XTTS, and their maintainers explicitly discuss computing and using these embeddings as separate files. (GitHub)
- Research on artificial speaker embeddings (Lux et al.) shows how to move embeddings along discovered directions while staying on the manifold, which is directly relevant to “C = B − A; S + k·C”. (arXiv)
-
Prior work on accent disentanglement / conversion that does something similar?
-
The entire field of Foreign Accent Conversion (FAC) is “accent → native → accent control”, but usually using:
- Discrete units + TTS: convert accent via self-supervised units and controllable accented TTS. (arXiv)
- Multi-scale accent modeling: global + local accent latents, with disentangled speaker embeddings. (arXiv)
- DART: explicit multi-level VAE that learns separate accent and speaker representations in multi-speaker TTS. (arXiv)
- LLM-style frameworks like SpeechAccentLLM that unify FAC and TTS using discrete speech codes and separate accent conditioning. (arXiv)
-
So yes: the concept (accent latent, interpolation, “accent strength”) is common, but people usually don’t do it with one A/B pair; they train explicit accent encoders across many speakers.
In other words: your idea is very much aligned with current research, but you are proposing the most minimal version using just speaker embeddings and one personal pair A/B.
1. What your C vector is really doing (conceptual background)
1.1 Speaker embeddings: what they encode
Speaker encoders (x-vector, ECAPA-TDNN, Resemblyzer, XTTS speaker encoder) are trained for speaker recognition:
Large reviews of deep speaker embeddings show they capture not just identity but also:
- gender, age,
- accent, language,
- speaking style, channel, and sometimes emotion. (Frontiers)
So if you embed:
- A = you, Mexican-accent English
- B = you, synthetic native English (ElevenLabs)
then C = B − A is not a “pure accent” vector. It encodes:
- accent changes,
- ElevenLabs’ prosody/style vs your own,
- any channel differences,
- plus normal noise in the embedding.
But if accent is a major systematic difference between A and B, then C will still have a strong accent component. This is why your idea is plausible.
1.2 Where accent is modeled in modern systems
Modern FAC and accented-TTS systems rarely rely purely on speaker embeddings for accent control. Instead they factor speech into three latent spaces:
-
Content / linguistic units
- Phonemes or self-supervised units (HuBERT/Wav2Vec discrete tokens) that encode “what is being said” with less speaker influence. (arXiv)
-
Speaker / timbre embeddings
- Equivalent to your speaker embedding: identity and global voice characteristics. (arXiv)
-
Accent / style latents
- Separate accent embeddings, global + local accent vectors, or VQ codes. (arXiv)
Your C lives in space (2) only, but real systems often control accent in (3), and sometimes also tweak the content representation (1). This mismatch is why C will be “accent + style” rather than perfectly clean accent.
2. Q1 – Who has done something close to “Accent + Delta Vector”?
No one (so far) publishes exactly “take my L2 voice and my synthetic native clone, do B − A, then add C elsewhere”. But three strands of work are extremely similar.
2.1 Principal directions in speaker embedding space
Lux et al. (Interspeech 2023) train a generator of artificial speaker embeddings, then:
- Discover principal directions in embedding space that correspond to attributes (brightness, breathiness, age, etc.).
- Move embeddings along those directions to control these attributes in a TTS system. (arXiv)
Mathematically this is exactly what you want:
- C_attr is a direction;
- S’(k) = S + k·C_attr controls that attribute;
- They also show how to stay on the speaker manifold when doing this (important for stability).
Accent is just another attribute; there is nothing stopping you from learning an “accent direction” in the same way.
2.2 Editing speaker embeddings to change style / Lombard effect
The Interspeech 2025 program includes work on gradual modeling of the Lombard effect by modifying speaker embeddings in a TTS model. (ISCA Archive)
- They keep text and speaker ID fixed,
- Modify the speaker embedding by adding a “Lombard direction”,
- Then the model produces more Lombard-style speech (speaking in noise) for the same voice.
Again, same pattern: find a direction in embedding space that corresponds to a style and add it.
2.3 Accent embeddings and FAC systems
FAC and accented-TTS models like SpeechAccentLLM, Multi-Scale Accent Modeling, and DART explicitly introduce an accent latent:
- SpeechAccentLLM uses discrete speech codes and an LLM-style model that is conditioned on accent for both TTS and FAC. (arXiv)
- Multi-Scale Accent Modeling uses global and local accent embeddings, separate from speaker embeddings, and shows that you can vary accent while keeping speaker identity largely fixed. (arXiv)
- DART uses multi-level VAEs and vector quantization to explicitly disentangle accent and speaker in a multi-speaker TTS model, so you can pick any combination “speaker X + accent Y”. (arXiv)
They do not literally take B − A, but at the conceptual level they:
- Represent accent as a vector/latent;
- Interpolate or swap accent latents;
- Sometimes scale intensity (k) to control accent strength.
So your B − A = C idea is like a single-speaker, hand-crafted version of what these systems learn across large populations.
3. Q2 – How separable are accent and timbre in real embedding spaces?
3.1 Evidence for entanglement
Several lines of work show that standard speaker embeddings are heavily entangled:
- A large scoping review on deep learning speech systems notes that speaker embeddings (x-vectors, ECAPA) encode language, accent, and channel alongside identity, because the models optimize for speaker discrimination, not factorization. (Frontiers)
- Multi-accent ASR adaptation papers train x-vectors on mixed accent data; these embeddings are used as “accent vectors” to adapt ASR, but they also carry speaker information, which can hurt generalization if not handled carefully. (AudioCC Lab)
- A 2025 Interspeech paper on flow of speech for speaker recognition explicitly evaluates ECAPA-TDNN and Resemblyzer, showing that these embeddings are sensitive to speaking style and speech rate, not just identity. (ISCA Archive)
Conclusion: Accent is not orthogonal to timbre in these spaces. Changing accent will inevitably tug on other attributes.
3.2 Evidence that accent and speaker can be separated with structure
The fact that Multi-Scale Accent Modeling and DART can disentangle accent and speaker with explicit design is strong evidence that the information is separable in principle:
These models only achieve clean accent control because they force accent → one latent and speaker → another. That tells you:
- If you stay in plain ECAPA or XTTS speaker embedding space, a single C is at best a useful approximation to an accent direction, not a mathematically pure one.
For your purposes, that is acceptable; you are exploring “how linear is this space”, not building production-grade FAC.
4. Q3 – Pipelines that support your kind of embedding manipulation
You already identified the key components correctly. Here is how they fit together, with what we know from docs and papers.
4.1 XTTS-v2 + external speaker encoder
XTTS (and XTTS-v2) are multilingual zero-shot TTS models that:
- Use a pre-trained speaker encoder to extract a speaker embedding from a reference audio. (arXiv)
- Feed that embedding into the TTS backbone as conditioning, enabling cross-lingual voice cloning from a few seconds of audio. (arXiv)
The Coqui docs and related work highlight that you can:
- Pre-compute d_vectors with
compute_embeddings.py and supply them as external speaker embeddings (d_vector_file), instead of always using the internal encoder on the fly. (GitHub)
This is exactly what you need:
- Compute A_i and B_i embeddings with the same encoder XTTS uses (ideal case).
- Form C = B_avg − A_avg.
- Construct S’(k) = normalize(S + k·C).
- Supply S’(k) as the d-vector to XTTS-v2.
Because the encoder is the one XTTS expects, you avoid dimension mismatch and keep edits relatively on-manifold.
4.2 Voice conversion frameworks with editable speaker embeddings
Pre-training Approaches for Voice Conversion (Unilight thesis) and related VC work show that:
- You can factor speech into content units + speaker embeddings + sometimes style.
- Speaker embeddings can be modified while leaving content units fixed, giving voice conversion and style control. (Unilight)
Non-autoregressive real-time accent conversion models with voice cloning also:
- Use a chain of ASR + content representation + TTS;
- Allow swapping accent while preserving identity. (arXiv)
If you plug your S + k·C into such a VC system’s speaker branch, you can test your idea in speech-to-speech mode rather than TTS mode.
4.3 Keeping embeddings from “collapsing” (stability)
Lux et al. explicitly study how to make embedding edits that:
- Change attributes gradually,
- Stay within the distribution of realistic speakers by generating embeddings from a learned model and moving along principal directions. (arXiv)
Your simpler approach can borrow some of their heuristics:
- Always renormalize the norm of S + k·C to match typical embeddings.
- Use small |k| (e.g., ≤0.5) to avoid out-of-distribution points.
- Optionally fit a small PCA around your natural embeddings and project edits back onto the top principal components.
In summary: XTTS-v2 plus its speaker encoder, or a VC model with editable speaker embeddings, is a realistic pipeline for doing exactly what you propose without the model instantly collapsing.
5. Q4 – Prior work on accent disentanglement / conversion
You are essentially reinventing, in a personal way, what FAC and accent-TTS papers formalize. It is useful to see how they solve the same problem.
5.1 Discrete unit + TTS accent conversion
“Accent conversion using discrete units with parallel data synthesized from controllable accented TTS” (Nguyen et al. 2024) does the following: (arXiv)
- Cluster self-supervised discrete units from native speech to get accent-neutral content tokens.
- Use controllable accented TTS to generate synthetic parallel data.
- Train an accent conversion model that maps L2 → “native units” and then back to waveform while preserving speaker identity.
This is very similar to your aim (L2 → L1 keeping speaker) but works primarily in content + accent latent spaces, not just speaker space.
5.2 Multi-scale accent modeling and DART
Both show that accent is a robust latent that can be split from speaker identity if you design the model that way. Your C is an implicit version of that accent latent, baked into speaker space.
5.3 LLM-style unified frameworks: SpeechAccentLLM
SpeechAccentLLM (2025) proposes a unified framework for FAC and TTS using:
- SpeechCodeVAE to get discrete speech codes with CTC.
- An LLM-style FAC & TTS model conditioned on accent.
- The same model can perform FAC (L2→L1) or TTS with chosen accent. (arXiv)
Compared to your approach:
- They have a dedicated accent latent and plenty of data.
- You have one or a few speakers and a direct manipulation in speaker space, which is much lighter but rougher.
5.4 Older FAC / voice morphing work
Earlier work (voice morphing, foreign accent conversion by voice morphing, etc.) treats FAC as a special case of voice conversion and uses:
- Spectral mapping (GMMs, trajectory-ML, etc.) to morph L2 → L1 while preserving some speaker characteristics. (Semantic Scholar)
These provide metrics and listening test protocols that you can reuse to evaluate your own system (see next).
6. How to prototype your idea concretely (A, B, C, XTTS)
Here is a practical experimental plan that matches your mental picture and the current tools.
6.1 Data: build good A and B
- Record many English sentences (20–50+ distinct sentences) with your real accent.
- Generate corresponding native-accent versions with ElevenLabs or another TTS using your cloned voice and the same text.
- Keep conditions similar (sample rate, levels, trimming) so that the main systematic difference really is accent & style.
6.2 Use the same speaker encoder as XTTS
- Use the XTTS speaker encoder rather than a random ECAPA, so your embeddings are in exactly the space the TTS expects. (arXiv)
Embed each utterance:
- A_i for real accent.
- B_i for “native” version.
Compute:
- A_avg = mean_i(A_i)
- B_avg = mean_i(B_i)
- C_raw = B_avg − A_avg
- C = C_raw / ||C_raw||
Optionally, also average per-pair differences D_i = B_i − A_i; then C = mean_i(D_i) normalized, which reduces some pair-specific noise.
6.3 Test if C really changes “accent” in embedding space
Before synthesis, test with simple classifiers:
-
Collect some native American English and L2 English samples from public corpora (e.g., L2-ARCTIC, Westbrook English Accent Dataset). (Unilight)
-
Embed them with the same encoder and train a small classifier f(x) that predicts “native vs non-native” or “American vs non-American”.
-
For some of your A samples (L2 accent) compute:
- x0 = embedding(A_sample).
- x(k) = normalize(x0 + k·C) for k in a grid (e.g., k = 0, 0.25, 0.5, 0.75, 1.0).
-
Evaluate f(x(k)).
If P(native) consistently increases with k, you have evidence that C aligns with a global accent direction, not just random noise.
You can even use f(x(k)) to calibrate k, e.g.:
- pick k_native so that P(native) ≈ 0.5–0.7;
- use k beyond that for “hyper-native” and below for “slightly accented”.
6.4 Feeding S + k·C to XTTS-v2
Once you are satisfied that C is doing something accent-like at the embedding level:
-
Choose a base speaker embedding S:
- S_English = A_avg (your real accent), or
- S_Spanish = embedding of your Spanish recordings.
-
For each k (e.g., −0.5, 0, 0.25, 0.5, 0.75):
- S’(k) = S + k·C
- S’(k) = S’(k) / ||S’(k)|| * ||S|| (restore original norm)
-
Supply S’(k) to XTTS-v2 as the speaker embedding (d-vector).
-
Synthesize:
- English text → check “accent strength” control (Native + k·C).
- Spanish text → see if you get “American-perceived Spanish” (Spanish + k·C).
You will likely observe:
- Small k: subtle accent shifts, some timbre/prosody changes.
- Large k: more drastic accent changes, but speaker identity may drift and artifacts may appear.
6.5 Evaluation
Use multiple axes:
- Speaker similarity: cosine similarity of embeddings between original you and S’(k) speech (should stay high for reasonable k).
- Accent classifier output: P(native) vs k (should increase with k up to a point).
- ASR WER: use a strong ASR model, ideally trained on native English, on your synthetic audio; FAC papers use WER reduction as a proxy for “more canonical pronunication”. (Unilight)
- Human listening tests (even informal with friends): “Which of these sounds more native, and which sounds most like the same person?”
This is exactly how recent FAC work evaluates their methods.
7. Warnings and practical pitfalls
-
Synthetic B is not ground truth
-
ElevenLabs’ “native” voice of you might bake in:
- its own prosody,
- its own noise model,
- its own slight channel quirks.
-
C will encode ElevenLabs style as well as accent. You might partially correct for this by also building C from real native speakers, not only ElevenLabs.
-
Accent is cross-lingual but not language-agnostic
- C learned on English may not transfer cleanly to Spanish.
- You might get “American-like rhythm/intonation” in Spanish, but local segmental changes will be messy.
-
Large |k| leaves the manifold
- As Lux et al. note, moving too far along a direction can produce unrealistic embeddings. (arXiv)
- In practice you keep |k| small and consider adding a projection (PCA or autoencoder) to stay near real speakers.
-
Accent vs intelligibility
- Papers on accent normalization emphasize that “more native” in accent does not always mean better intelligibility or naturalness. (arXiv)
- Listen critically: if hyper-native sounds uncanny, it might not be useful as a learning target.
8. Bottom line
-
Conceptually, you are in the same space as:
- Principal-direction editing of speaker embeddings; (arXiv)
- Multi-scale accent modeling & DART; (arXiv)
- Discrete-unit FAC and SpeechAccentLLM. (arXiv)
-
People have not, as far as the literature shows, published exactly “personal accent delta = synthetic-native minus real-accent” as a reusable vector, but your proposal is a natural, lighter-weight variant of existing methods.
-
With XTTS-v2 and its speaker encoder, plus a bit of embedding arithmetic and sanity-check classification, you can absolutely run the experiment you imagined and empirically test how linear and reusable accent vectors really are for your own voice.