Hmm… ControlNet is nearly unusable with models that have even slightly different architectures. While it appears difficult to use with SD-Turbo, an equivalent ControlNet for standard SD 2.1 seems to exist. e.g. https://huggingface.co/daydreamlive/TemporalNet2-stable-diffusion-2-1
You are correct that you should not expect your current SD-1.5 TemporalNet2 to be “compatible” with SD-Turbo in any reliable way, and it is incorrect to think that --upcast_attention would make it compatible. That flag only changes how attention is numerically computed (precision), not which base model a ControlNet is trained for.
Below is the reasoning, with canonical references.
1. What you have working now (SD-1.5 + TemporalNet2)
You currently have:
TemporalNet2 itself is described as:
“TemporalNet was a ControlNet model designed to enhance the temporal consistency of generated outputs”
and further discussions and docs explain it as:
You converted wav/TemporalNet2 to diffusion_pytorch_model.bin, and your pipeline now correctly accepts these 6 conditioning channels and produces images.
So your working configuration is:
SD-1.5 base + TemporalNet2 (SD-1.5-trained ControlNet, 6-channel conditioning).
2. What SD-Turbo actually is (canonical reference)
The canonical reference for SD-Turbo is the Hugging Face model card stabilityai/sd-turbo. It states:
- “SD-Turbo is a distilled version of Stable Diffusion 2.1, trained for real-time synthesis.”
- The distillation uses Adversarial Diffusion Distillation (ADD), which allows sampling in 1–4 steps while retaining high image quality.
So:
- Architecturally, SD-Turbo is still a UNet2DConditionModel with the same high-level structure as Stable Diffusion 2.1.
- But its weights and denoising behavior are those of a distilled SD-2.1, not SD-1.5.
Community discussions also summarize this simply as:
“That is based on SD 2.1, not 1.5.”
So SD-Turbo is SD-2.1-family, not SD-1.5-family.
3. How ControlNet (and TemporalNet2) depends on the base model
The key design of ControlNet is: it copies the UNet blocks of the base model and adds a trainable branch that learns to inject conditioning information:
-
The official ControlNet README (canonical reference: lllyasviel/ControlNet) describes it as:
“ControlNet is a neural network structure to control diffusion models by adding extra conditions…
It copies the weights of neural network blocks into a ‘locked’ copy and a ‘trainable’ copy. The ‘trainable’ one learns your condition. The ‘locked’ one preserves your model.”
-
The model card explicitly lists several “ControlNet+SD1.5” models (scribble, segmentation, etc.), emphasizing that each ControlNet is tied to a specific base family:
“The ControlNet+SD1.5 model to control SD using human scribbles…”
Independent descriptions of ControlNet’s architecture say the same thing: extra inputs are fed into a trainable copy of the original UNet encoder and then fused with the base model in intermediate layers.
Consequences:
-
Shape coupling
- The ControlNet UNet must be architecturally compatible with the base model’s UNet: same number of channels, blocks, attention layout, etc.
- That is why ControlNets are distributed separately for SD-1.5 vs SD-2.x vs SDXL.
-
Training-distribution coupling
- Even if shapes line up, ControlNet’s weights are trained on the feature distribution of a particular base model (its noise schedule, text encoder, dataset, etc.).
- Moving the same ControlNet to another base (different encoder, different noise schedule, distilled behavior) gives no guarantee of correct behavior.
Early ControlNet documentation and A1111 guides were explicit that the first generation models were for SD-1.5 only and not meant to be used with SD-2.x.
TemporalNet2 is “just” a 6-channel special-purpose ControlNet:
- It uses the same ControlNet idea (copy of UNet blocks).
- But its conditioning is previous frame + optical flow.
- And it was trained in the SD-1.5 ecosystem, as the TemporalNet/TemporalNet2 repos and discussions make clear.
So your wav/TemporalNet2 is, in effect:
“A ControlNet-style UNet side branch that expects SD-1.5’s latent features and noise schedule.”
4. What --upcast_attention actually does (canonical script)
The --upcast_attention option you quoted comes from the official Diffusers conversion script
convert_original_stable_diffusion_to_diffusers.py. The relevant part is:
parser.add_argument(
"--upcast_attention",
action="store_true",
help=(
"Whether the attention computation should always be upcasted. "
"This is necessary when running stable diffusion 2.1."
),
)
The canonical meaning of this flag is:
-
In the UNet, attention is often computed in float16 for speed and memory.
-
For Stable Diffusion 2.1, using FP16 attention can cause numerical issues (overflows/instability) because of:
- Higher resolution,
- Different text encoder (OpenCLIP-ViT/H) and activation ranges,
as described in the SD-2.0/2.1 release notes and README.
-
--upcast_attention just tells Diffusers to compute attention in float32 (FP32) even if the rest of the model is half-precision, to avoid those stability issues.
Important:
- It does not change the architecture.
- It does not change which base model a ControlNet was trained for.
- It does not adapt SD-1.5-trained weights to SD-2.1/SD-Turbo.
It is purely a numerical-stability flag for SD-2.1-family models, not a compatibility switch.
5. Why an SD-1.5 TemporalNet2 is not “made compatible” with SD-Turbo
Putting this together:
5.1 Base families are different
-
SD-1.5: uses CLIP-ViT/L/14 as text encoder, trained on LAION-5B v1-style data at 512×512; the original 1.x line.
-
SD-2.1: same overall UNet parameter count as 1.5 but:
- Uses OpenCLIP-ViT/H as text encoder,
- Trained with a different dataset and filtering regime,
- Introduces v-prediction variants and higher resolution.
-
SD-Turbo: is explicitly described as
“a distilled version of Stable Diffusion 2.1 … trained for real-time synthesis using Adversarial Diffusion Distillation (ADD).”
So from TemporalNet2’s point of view:
The ControlNet branch (TemporalNet2) you have was never trained on those features.
5.2 What can actually happen if you force it
If you somehow wire an SD-1.5 TemporalNet2 into an SD-Turbo pipeline:
-
Best case:
- The UNet shapes happen to align well enough that you avoid hard shape errors.
- The model runs but the temporal control is weak or unstable, often ignored or giving artifacts.
-
Worst case:
- Weight shapes don’t match some blocks,
- You get immediate tensor size errors (
mat1 and mat2 shapes cannot be multiplied, etc.), as seen when trying to mix mismatched ControlNet and base models in Diffusers.
Either way, --upcast_attention does not fix any of this; it only changes FP16→FP32 math inside attention layers.
5.3 Community guidance mirrors this
- ControlNet’s official README and model cards emphasize base-specific models (“ControlNet+SD1.5 model”, SD-2.1 branches, SDXL branches).
- A1111 ControlNet discussions explicitly state early on that the extension and its models were for SD-1.5 only, and not SD-2.x, until separate models appeared.
So your underlying intuition:
“If I go from SD-1.5 to SD-Turbo, I can no longer meaningfully use the SD-1.5 TemporalNet2.”
is essentially correct. You should treat SD-1.5 TemporalNet2 and SD-Turbo as belonging to different ecosystems.
And the conjecture:
“Maybe --upcast_attention makes the SD-1.5 TemporalNet2 compatible with SD-Turbo.”
is incorrect. That flag changes attention precision only; it does not retarget, re-train, or re-architect the ControlNet.
6. If you want SD-Turbo and TemporalNet-style behavior
This part goes beyond your exact question, but it is the natural “what next?”:
- SD-Turbo is SD-2.1-family, so the correct temporal ControlNet to combine with it would be a TemporalNet2 trained for SD-2.1, not the SD-1.5 version.
- There are community TemporalNet2 checkpoints published specifically for Stable Diffusion 2.1 (e.g. model cards labeled “TemporalNet2 stable-diffusion-2-1”), which are architecturally aligned to SD-2.1 and are a far better match for SD-Turbo than your SD-1.5 TemporalNet2. (These we discussed earlier; they are separate repos from the original SD-1.5 TemporalNet2.)
The pattern that other projects use is the same as with 2.1 ControlNets like canny/depth:
- Base:
stabilityai/sd-turbo (distilled SD-2.1).
- ControlNet: any SD-2.1-trained ControlNet (canny, depth, TemporalNet2-for-2.1, etc.).
That is still a bit experimental (because SD-Turbo is distilled), but it is fundamentally more sound than trying to bolt an SD-1.5 ControlNet onto SD-Turbo.
7. Direct answers to your questions
“Moving forward, I am thinking of going from SD-1.5 to SD-Turbo … am I right to assume that I can no longer use the TemporalNet because it will not be compatible?”
For your current TemporalNet2 (the SD-1.5 one you converted via wav/TemporalNet2):
- Yes, treat it as not compatible in any reliable way with SD-Turbo.
- Even if you can hack something that runs, it won’t be architecturally or training-wise correct for SD-2.1/SD-Turbo.
“Is it an incorrect conjecture to say that this switch (--upcast_attention) would make the controlnet, which was made for SD-1.5, compatible with SD-Turbo?”
Yes, that conjecture is incorrect:
To use TemporalNet-style control with SD-Turbo, you should look at TemporalNet2 checkpoints trained for SD-2.1, not reuse the SD-1.5 checkpoint with a precision flag.
Bullet summary
-
Your current TemporalNet2 (wav/TemporalNet2) is a ControlNet-style model trained for SD-1.5, using 6-channel conditioning (previous frame + optical flow).
-
SD-Turbo (stabilityai/sd-turbo) is a distilled Stable Diffusion 2.1 model, not SD-1.5.
-
ControlNets are base-specific: they copy the base UNet and are trained on that base model’s feature distribution (text encoder, noise schedule, dataset).
-
The --upcast_attention flag in the conversion script only upcasts attention math to FP32 for SD-2.1; it does not adapt SD-1.5 ControlNet weights to SD-2.1 or SD-Turbo.
-
Therefore:
- You should not expect your SD-1.5 TemporalNet2 to be compatible with SD-Turbo.
--upcast_attention does not solve this; it is not a compatibility switch.