OOM error after creating pipeline

Hi everyone, I’ve got a strange error I’m hoping to get some insight on.

My training script for text-to-image generates a sample image every 500 steps. To do this I create a StableDiffusionPipeline via from_pretrained. The image is generated fine and and once it’s done my GPU memory usage drops down to “idle” at around 13.5G. But when the next epoch begins the memory usage jumps up to 23G and the script crashes at this step:

model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample

I delete the pipeline right after the image is generated so it’s not an issue of garbage collection, right?

del pipeline
gc.collect()
torch.cuda.empty_cache()
gc.collect()

If I disable image generation and pipeline creation everything works as expected, no OOM errors. If I disable image generation but keep the pipeline creation it crashes as described above. I think the issue here isn’t the image generation but rather the pipeline.

I was reading some posts and added this to the pipeline deletion:

pipeline.to('cpu')
del pipeline

And upon starting the next epoch I get this error:

latents = self.vae.encode(batch["pixel_values"].to(self.weight_dtype)).latent_dist.sample()

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.HalfTensor) should be the same

Now why is the in-progress model also on the CPU?

You know, I was also getting this unusual error after generating images.

RuntimeError: Input type (c10::Half) and bias type (float) should be the same

What’s going on here? It looks like the pipeline is interfering with the trained model?

unet = accelerator.unwrap_model(copy.copy(unet))

Seems to solve the input type error. I’m sharing the accelerator and model objects between a super and child class so maybe somethings getting messed up? copy.deepcopy() didn’t work, only copy.copy().

But I did find the solution to the my original problem. It appears that xformers is disabled on the unet somewhere in here:

unet = self.accelerator.unwrap_model(copy.copy(self.unet))
if self.use_ema:
    self.ema_unet.copy_to(unet.parameters())
pipeline = StableDiffusionPipeline.from_pretrained(
    self.pretrained_model_path,
    text_encoder=self.text_encoder,
    vae=self.vae,
    unet=unet,
    revision=self.revision,
    safety_checker=None,
)
pipeline.to(self.accelerator.device)
pipeline.enable_xformers_memory_efficient_attention()
pipeline.enable_attention_slicing()

And re-enabling it with unet.enable_xformers_memory_efficient_attention() seems to solve the problem.

I don’t understand the inner workings of Python to offer an explanation of this behavior. Anyone got any ideas? I’m really confused what’s going on.

My parent/child class setup isn’t very complicated. It’s a parent class that handles the common parts of training textual inversion, DreamBooth, and text to image; the child class is just the training method specific stuff and the training loop.

Hello,

If you are working on main, I believe that a consequence of this PR is that attention slicing and xformers attention are mutually exclusive. So I’d just call pipeline.enable_xformers_memory_efficient_attention and not enable attention slicing.

Thanks for the reply!

I did a quick test with the train_text_to_image.py example script and added a simple function to generate images when saving. Only having pipeline.enable_xformers_memory_efficient_attention() doesn’t fix the input type problem.

What fixes it for me is setting the unet unwrapper to unet = accelerator.unwrap_model(unet, keep_fp32_wrapper=True) and

with torch.autocast('cuda'):
    image = pipeline('output.png').images[0]

With this code I could actually have both attention slicing and xformers enabled at the same time. But it was just a quick test so don’t take my word for it.

In my project (not the example script) I need keep_fp32_wrapper=True (to fix the input type error) and ONLY pipeline.enable_xformers_memory_efficient_attention() (to fix OOM). I no longer need the copy.copy() on the unet unwrapper. I was already generating images using with torch.autocast("cuda"), torch.inference_mode():. I no longer need to re-enable xformers on the unet after inference.

My code is pretty similar to train_text_to_image.py (only real difference is the inheritance thing) so I don’t know what’s different.