10 KiB
Training Image Generation Models with Diffusion (Stable Diffusion XL + LoRA)
Introduction
Image generation using diffusion models has become one of the most transformative capabilities of modern artificial intelligence. These models are capable of generating high‑quality images from natural language descriptions, enabling applications across multiple industries.
Real-world use cases include:
- Marketing and advertising -- generating visual assets automatically.
- Software documentation and presentations -- producing diagrams and illustrations for technical content.
- Game development -- generating textures, characters, and environments.
- Product design -- visualizing concepts before prototypes exist.
- Enterprise automation -- generating architecture diagrams or slide illustrations automatically.
In enterprise environments, diffusion models can be integrated into automated pipelines. For example, a presentation generator can automatically produce slide images that represent architectural concepts such as:
- API Gateways
- Databases
- Integration architectures
- Cloud infrastructures
Instead of manually creating diagrams, an AI pipeline can generate them dynamically from structured prompts.
This tutorial explains how to train and use a Stable Diffusion XL model with LoRA fine‑tuning, and how to deploy the model to generate images programmatically.
Technologies Involved
Diffusion Models
Diffusion models generate images by iteratively denoising random noise. The training process teaches a neural network how to reverse a noise process applied to real images.
The process works as follows:
- Start with a real image
- Gradually add noise until the image becomes pure noise
- Train a model to reverse this process
- During inference, start with noise and iteratively remove it
This allows the model to synthesize new images from text descriptions.
Popular diffusion models include:
- Stable Diffusion
- Stable Diffusion XL (SDXL)
- DALL‑E
- Imagen
In this tutorial we use Stable Diffusion XL, which provides:
- Higher resolution
- Better text understanding
- Dual text encoders
- Micro‑conditioning
Stable Diffusion XL (SDXL)
SDXL is an advanced diffusion architecture that improves generation quality through:
- Two text encoders
- Improved conditioning
- Higher resolution generation
- Better prompt interpretation
Unlike earlier diffusion models, SDXL requires:
- two tokenizers
- two text encoders
- pooled embeddings
- time conditioning parameters
LoRA (Low Rank Adaptation)
Training diffusion models from scratch is extremely expensive.
Instead, LoRA allows fine‑tuning large models efficiently by training small low‑rank matrices that modify the attention layers of the network.
Advantages:
- Very small training footprint
- Works with limited VRAM
- Easy to merge into the base model
- Fast training
In this project, LoRA is applied to the UNet attention layers.
HuggingFace Diffusers
The Diffusers library provides a high‑level API for working with diffusion models.
It includes:
- pipelines
- schedulers
- training utilities
- optimization helpers
Main components used:
StableDiffusionXLPipelineDDPMSchedulerDPMSolverMultistepScheduler
PyTorch
PyTorch is used for:
- training loops
- GPU acceleration
- tensor operations
- neural network execution
Code Walkthrough
Dataset Structure
The training script expects a dataset structured as:
dataset/
images/
image1.png
image2.png
captions/
image1.txt
image2.txt
Each caption describes the image.
Example caption:
enterprise cloud architecture diagram with API gateway and database
Dataset Loader
The dataset loader reads images and their captions.
Key operations:
- resizing images
- converting to tensors
- normalization
Important section:
transforms.Compose([
transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
transforms.ToTensor(),
transforms.Normalize([0.5,0.5,0.5],[0.5,0.5,0.5])
])
Normalization is important because diffusion models expect images in a [-1,1] range.
Prompt Encoding
SDXL uses two text encoders.
The function:
encode_prompt_sdxl()
performs:
- tokenization of captions
- embedding generation
- concatenation of embeddings
Important concept:
prompt_embeds = torch.cat([prompt_embeds_1, prompt_embeds_2], dim=-1)
This merges both encoders into a single conditioning representation.
Latent Encoding
Images are encoded into latent space using the VAE.
latents = vae.encode(images).latent_dist.sample()
The VAE compresses images before training the diffusion process.
This drastically reduces memory usage.
Noise Training
Diffusion training consists of predicting noise added to images.
noise = torch.randn_like(latents)
noisy_latents = scheduler.add_noise(latents, noise, timesteps)
The model learns to predict this noise.
Loss function:
loss = F.mse_loss(noise_pred.float(), noise.float())
This is the standard diffusion training loss.
LoRA Configuration
LoRA modifies attention layers of the UNet.
LoraConfig(
r=8,
lora_alpha=16,
target_modules=["to_q","to_k","to_v","to_out.0"]
)
Key parameters:
Parameter Description
r rank of adaptation alpha scaling factor target_modules attention layers to adapt
Training Loop
Main training steps:
- Encode image into latent space
- Add noise
- Encode text prompt
- Predict noise with UNet
- Compute loss
- Backpropagate
The loop runs for multiple epochs:
for epoch in range(EPOCHS):
for step,(images,captions) in enumerate(dataloader):
Saving the LoRA Model
After training:
unet.save_pretrained("sdxl_lora")
The LoRA weights can later be merged into the base model.
Image Generation Pipeline
The generation script loads:
- SDXL base model
- trained LoRA
- optimized scheduler
Key configuration:
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0"
)
Scheduler optimization:
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
This significantly accelerates generation.
Memory Optimization
To support Mac M‑series GPUs or limited VRAM:
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
These techniques reduce peak memory usage.
Prompt Caching
Images are cached using a hash of the prompt.
hashlib.sha256(prompt.encode()).hexdigest()
This prevents regenerating identical images repeatedly.
Image Generation
Image generation call:
image = pipe(
prompt,
negative_prompt=NEGATIVE,
num_inference_steps=25,
guidance_scale=7,
height=1024,
width=1024
).images[0]
Important parameters:
Parameter Meaning
num_inference_steps diffusion iterations guidance_scale prompt strength height / width image resolution
Deployment
Install Dependencies
pip install torch diffusers transformers peft accelerate pillow torchvision
Training
Run:
python train_diffusion.py
Training will produce:
sdxl_lora/
containing LoRA weights.
Running the Generator
python generate_image3.py
Example prompt:
clean enterprise illustration,
corporate presentation slide,
minimal design,
white background,
oracle integration architecture
Generated images are saved in:
images/cache/
Testing
You can test generation with different prompts:
Example:
enterprise microservices architecture diagram
or
cloud integration architecture with api gateway
Conclusion
Diffusion models are revolutionizing image generation by allowing AI systems to synthesize visual content from text descriptions.
Using Stable Diffusion XL combined with LoRA fine‑tuning enables:
- efficient training
- domain specialization
- enterprise use cases
- automated content generation
In practical systems, diffusion models can be integrated into larger pipelines such as:
- automated presentation builders
- documentation systems
- AI‑generated diagrams
- marketing automation platforms
With efficient techniques like LoRA, high‑quality image generation is now accessible even on consumer hardware such as:
- RTX GPUs
- Apple Silicon
- workstation GPUs
As diffusion architectures continue evolving, they will increasingly become a core component of AI‑driven content generation systems.
Acknowledgments
- Author - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)