Debugging Custom Stable Diffusion Pipeline for 1D Signal Generation

Hi everyone,

I’m building a custom Stable Diffusion pipeline to generate 1D signal data instead of images. Here’s what I’ve implemented so far:

  • :white_check_mark: Trained a custom autoencoder to encode raw 1D signals into a latent space and decode them back. The latent dimension is 4.
  • :white_check_mark: Trained a 1D UNet that takes as input the latent signal, text embedding, and timestep embedding, and predicts noise.
  • :white_check_mark: Built a custom pipeline using DDIMScheduler to iteratively denoise and decode signals from latent space.

However, the final generated signal is just noise and doesn’t resemble the training data. I’m not sure whether:

  • My latent space isn’t properly learned,
  • The UNet training is unstable,
  • Or my sampling loop has a mistake.

Has anyone encountered a similar issue when adapting Stable Diffusion to 1D data?
Any help, debugging tips, or pointers would be greatly appreciated!

Thanks in advance :folded_hands:
Here’s my sampling pipeline (__call__):

@torch.no_grad()
def call(
self,
prompt,
signal_length=None,
latent_length=None,
num_inference_steps=50,
guidance_scale=7.5,
generator=None,
latents=None,
return_dict=True,
output_type=“np”,
sigmas: List[float] = None,
timesteps: List[int] = None,
):
device = self._execution_device
batch_size = 1 if isinstance(prompt, str) else len(prompt)
self._interrupt = False

# 1. Determine latent/signal length
if signal_length is None:
    latent_length = getattr(self.vae.config, "latent_channels", 64)
    signal_length = getattr(self.vae.config, "input_length", 500)
else:
    latent_length = latent_length
    signal_length = signal_length

# 2. Encode prompt
prompt_embeds, negative_prompt_embeds = self.encode_prompt(
    prompt, device, guidance_scale=guidance_scale
)
if guidance_scale > 1.0:
    prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

# 3. Prepare timesteps
timesteps, num_inference_steps = retrieve_timesteps(
    self.scheduler, num_inference_steps, device, timesteps, sigmas
)

# 4. Prepare latents
if latents is None:
    latents = randn_tensor(
        (batch_size, self.unet.config["in_channels"], latent_length),
        generator=generator,
        device=device,
        dtype=prompt_embeds.dtype
    )
else:
    latents = latents.to(device)
latents *= self.scheduler.init_noise_sigma

# 5. Denoising loop
num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order

with self.progress_bar(total=num_inference_steps) as progress_bar:
    for i, t in enumerate(timesteps):
        if self.interrupt:
            continue

        latent_model_input = torch.cat([latents] * 2) if guidance_scale > 1.0 else latents
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
        t_batch = torch.full((batch_size,), t, dtype=torch.long, device=t.device)

        noise_pred = self.unet(
            sample=latent_model_input,
            timestep=t_batch,
            text_embedding=prompt_embeds
        ).sample

        if guidance_scale > 1.0:
            noise_uncond, noise_text = noise_pred.chunk(2)
            noise_pred = noise_uncond + guidance_scale * (noise_text - noise_uncond)

        latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]

        if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
            progress_bar.update()

# 6. Decode
self.vae.eval()
signal = self.vae.decode(latents)[0]
if output_type == "np":
    signal = signal.squeeze(1).detach().cpu().numpy()

return {"signals": signal} if return_dict else signal
1 Like

According to several reports, there may be a bug in the newer Diffusers UNet1DModel, but the general answer I got from ChatGPT is as follows for now.


Below is a refined sampling pipeline built on top of Hugging Face Diffusers and adapted for 1D signal generation. It addresses common pitfalls around scheduling, guidance, and debugging, and draws on practices from models like DiffWave and ArchiSound.


from diffusers import UNet1DModel, DDIMScheduler
import torch

class SignalDiffusionPipeline:
    def __init__(self, vae, unet: UNet1DModel, scheduler: DDIMScheduler,
                 tokenizer, text_encoder):
        self.vae = vae.to(vae.device)
        self.unet = unet.to(vae.device)
        self.scheduler = scheduler
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder

    @torch.no_grad()
    def __call__(self, prompt: str, num_inference_steps=50,
                 guidance_scale=7.5, generator=None, debug=False):
        device = self.vae.device

        # 1️⃣ Text encoding & classifier‑free guidance
        tokens = self.tokenizer([prompt], return_tensors="pt").to(device)
        text_emb = self.text_encoder(**tokens).last_hidden_state
        if guidance_scale > 1:
            null_emb = self.text_encoder(
                **self.tokenizer([""], return_tensors="pt").to(device)
            ).last_hidden_state
            text_emb = torch.cat([null_emb, text_emb], dim=0)

        # 2️⃣ Correctly set diffusion timesteps
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

        # 3️⃣ Prepare noisy latent initialization
        bs = text_emb.shape[0]
        latent = torch.randn(
            (bs, self.vae.latent_channels, self.vae.latent_length),
            generator=generator, device=device
        ) * self.scheduler.init_noise_sigma

        # 4️⃣ Reverse diffusion denoising
        for i, t in enumerate(timesteps):
            latent_in = latent
            if guidance_scale > 1:
                latent_in = torch.cat([latent, latent], dim=0)
            latent_in = self.scheduler.scale_model_input(latent_in, t)

            noise_pred = self.unet(latent_in, t, encoder_hidden_states=text_emb).sample
            if guidance_scale > 1:
                noise_uncond, noise_text = noise_pred.chunk(2)
                noise_pred = noise_uncond + guidance_scale * (noise_text - noise_uncond)

            latent = self.scheduler.step(noise_pred, t, latent).prev_sample

            # 5️⃣ Debug: Inspect intermediate signals
            if debug and i in {len(timesteps)//4, len(timesteps)//2, len(timesteps)-1}:
                mid = self.vae.decode(latent).reshape(bs, self.vae.input_length)
                print(f"Step {i}: mean={mid.mean():.4f}, std={mid.std():.4f}")

        # 6️⃣ Decode to final signal
        signal = self.vae.decode(latent).reshape(bs, self.vae.input_length)
        return signal.cpu().numpy()

:check_mark: Key improvements

  • Proper scheduler usage: .set_timesteps(...) aligns inference with training, eliminating τ mismatches that often break denoising.
  • Guidance batching: Following the 2×-batch trick ensures classifier-free guidance is applied correctly.
  • .prev_sample retrieval: Matches underlying API expectations, avoiding subtle algorithmic mismatches.
  • Intermediate debugging: Printouts at 25%, 50%, and final steps help verify signal emergence over noise.

:books: Reference resources

  • Diffusers’ UNet1DModel docs — details block types and expected signature (huggingface.co, discuss.huggingface.co, huggingface.co)
  • Forum issue: users reported similar issues with trivial 1D data due to misunderstanding of timesteps or model behavior
  • DiffWave, ArchiSound — demonstrate use of stacked 1D UNets with dilated convs and spectral conditioning for audio diffusion (arxiv.org)

:white_check_mark: What’s next?

  1. Run with debug=True. Do intermediate mean/std values drift toward training-signal stats?

  2. If not, examine:

    • VAE quality: latent distributions and reconstructions.
    • UNet training: standalone noise-prediction accuracy.
    • Scheduler consistency: ensure β schedule and timesteps match training configuration.
  3. If behavior remains noisy, consider integrating dilated convs or spectral-domain losses for better structure, following details in DiffWave and ArchiSound.

Let me know how the intermediate signals evolve or if you’d like help measuring latent statistics or adjusting architecture!

1 Like