Debugging Custom Stable Diffusion Pipeline for 1D Signal Generation

According to several reports, there may be a bug in the newer Diffusers UNet1DModel, but the general answer I got from ChatGPT is as follows for now.


Below is a refined sampling pipeline built on top of Hugging Face Diffusers and adapted for 1D signal generation. It addresses common pitfalls around scheduling, guidance, and debugging, and draws on practices from models like DiffWave and ArchiSound.


from diffusers import UNet1DModel, DDIMScheduler
import torch

class SignalDiffusionPipeline:
    def __init__(self, vae, unet: UNet1DModel, scheduler: DDIMScheduler,
                 tokenizer, text_encoder):
        self.vae = vae.to(vae.device)
        self.unet = unet.to(vae.device)
        self.scheduler = scheduler
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder

    @torch.no_grad()
    def __call__(self, prompt: str, num_inference_steps=50,
                 guidance_scale=7.5, generator=None, debug=False):
        device = self.vae.device

        # 1️⃣ Text encoding & classifier‑free guidance
        tokens = self.tokenizer([prompt], return_tensors="pt").to(device)
        text_emb = self.text_encoder(**tokens).last_hidden_state
        if guidance_scale > 1:
            null_emb = self.text_encoder(
                **self.tokenizer([""], return_tensors="pt").to(device)
            ).last_hidden_state
            text_emb = torch.cat([null_emb, text_emb], dim=0)

        # 2️⃣ Correctly set diffusion timesteps
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

        # 3️⃣ Prepare noisy latent initialization
        bs = text_emb.shape[0]
        latent = torch.randn(
            (bs, self.vae.latent_channels, self.vae.latent_length),
            generator=generator, device=device
        ) * self.scheduler.init_noise_sigma

        # 4️⃣ Reverse diffusion denoising
        for i, t in enumerate(timesteps):
            latent_in = latent
            if guidance_scale > 1:
                latent_in = torch.cat([latent, latent], dim=0)
            latent_in = self.scheduler.scale_model_input(latent_in, t)

            noise_pred = self.unet(latent_in, t, encoder_hidden_states=text_emb).sample
            if guidance_scale > 1:
                noise_uncond, noise_text = noise_pred.chunk(2)
                noise_pred = noise_uncond + guidance_scale * (noise_text - noise_uncond)

            latent = self.scheduler.step(noise_pred, t, latent).prev_sample

            # 5️⃣ Debug: Inspect intermediate signals
            if debug and i in {len(timesteps)//4, len(timesteps)//2, len(timesteps)-1}:
                mid = self.vae.decode(latent).reshape(bs, self.vae.input_length)
                print(f"Step {i}: mean={mid.mean():.4f}, std={mid.std():.4f}")

        # 6️⃣ Decode to final signal
        signal = self.vae.decode(latent).reshape(bs, self.vae.input_length)
        return signal.cpu().numpy()

:check_mark: Key improvements

  • Proper scheduler usage: .set_timesteps(...) aligns inference with training, eliminating τ mismatches that often break denoising.
  • Guidance batching: Following the 2×-batch trick ensures classifier-free guidance is applied correctly.
  • .prev_sample retrieval: Matches underlying API expectations, avoiding subtle algorithmic mismatches.
  • Intermediate debugging: Printouts at 25%, 50%, and final steps help verify signal emergence over noise.

:books: Reference resources

  • Diffusers’ UNet1DModel docs — details block types and expected signature (huggingface.co, discuss.huggingface.co, huggingface.co)
  • Forum issue: users reported similar issues with trivial 1D data due to misunderstanding of timesteps or model behavior
  • DiffWave, ArchiSound — demonstrate use of stacked 1D UNets with dilated convs and spectral conditioning for audio diffusion (arxiv.org)

:white_check_mark: What’s next?

  1. Run with debug=True. Do intermediate mean/std values drift toward training-signal stats?

  2. If not, examine:

    • VAE quality: latent distributions and reconstructions.
    • UNet training: standalone noise-prediction accuracy.
    • Scheduler consistency: ensure β schedule and timesteps match training configuration.
  3. If behavior remains noisy, consider integrating dilated convs or spectral-domain losses for better structure, following details in DiffWave and ArchiSound.

Let me know how the intermediate signals evolve or if you’d like help measuring latent statistics or adjusting architecture!

1 Like