Replace Stable Diffusion class-conditional text with rows of attributes

I want to use Stable Diffusion model weights to generate class-conditional images- however, I don’t want these images to be conditional on a text prompt, but rather on a number of binary class attributes/rows.

In order to do this, I was thinking of using Diffusers, as it seemed the most straightforward. My thinking was to replace the CLIP text encoder/tokenizer with a custom encoder which maps the attribute rows into the latent space, however I can’t seem to find resources on this online, and was wondering if it was possible/feasible within the Diffusers library.

I understand that the StableDiffusionPipeline is likely too strict, however, I was wondering how I would define a model with these attribute rows as the conditioner for the generation, and how this model could be trained/fine-tuned.