2026-03-08 18:14:58 -03:00
2026-03-08 18:14:58 -03:00
2026-03-07 07:13:48 -03:00
2026-03-07 07:38:15 -03:00

Training Image Generation Models with Diffusion (Stable Diffusion XL + LoRA)

Introduction

Image generation using diffusion models has become one of the most transformative capabilities of modern artificial intelligence. These models are capable of generating highquality images from natural language descriptions, enabling applications across multiple industries.

Real-world use cases include:

  • Marketing and advertising -- generating visual assets automatically.
  • Software documentation and presentations -- producing diagrams and illustrations for technical content.
  • Game development -- generating textures, characters, and environments.
  • Product design -- visualizing concepts before prototypes exist.
  • Enterprise automation -- generating architecture diagrams or slide illustrations automatically.

In enterprise environments, diffusion models can be integrated into automated pipelines. For example, a presentation generator can automatically produce slide images that represent architectural concepts such as:

  • API Gateways
  • Databases
  • Integration architectures
  • Cloud infrastructures

Instead of manually creating diagrams, an AI pipeline can generate them dynamically from structured prompts.

This tutorial explains how to train and use a Stable Diffusion XL model with LoRA finetuning, and how to deploy the model to generate images programmatically.


Technologies Involved

Diffusion Models

Diffusion models generate images by iteratively denoising random noise. The training process teaches a neural network how to reverse a noise process applied to real images.

The process works as follows:

  1. Start with a real image
  2. Gradually add noise until the image becomes pure noise
  3. Train a model to reverse this process
  4. During inference, start with noise and iteratively remove it

This allows the model to synthesize new images from text descriptions.

Popular diffusion models include:

  • Stable Diffusion
  • Stable Diffusion XL (SDXL)
  • DALLE
  • Imagen

In this tutorial we use Stable Diffusion XL, which provides:

  • Higher resolution
  • Better text understanding
  • Dual text encoders
  • Microconditioning

Stable Diffusion XL (SDXL)

SDXL is an advanced diffusion architecture that improves generation quality through:

  • Two text encoders
  • Improved conditioning
  • Higher resolution generation
  • Better prompt interpretation

Unlike earlier diffusion models, SDXL requires:

  • two tokenizers
  • two text encoders
  • pooled embeddings
  • time conditioning parameters

LoRA (Low Rank Adaptation)

Training diffusion models from scratch is extremely expensive.

Instead, LoRA allows finetuning large models efficiently by training small lowrank matrices that modify the attention layers of the network.

Advantages:

  • Very small training footprint
  • Works with limited VRAM
  • Easy to merge into the base model
  • Fast training

In this project, LoRA is applied to the UNet attention layers.


HuggingFace Diffusers

The Diffusers library provides a highlevel API for working with diffusion models.

It includes:

  • pipelines
  • schedulers
  • training utilities
  • optimization helpers

Main components used:

  • StableDiffusionXLPipeline
  • DDPMScheduler
  • DPMSolverMultistepScheduler

PyTorch

PyTorch is used for:

  • training loops
  • GPU acceleration
  • tensor operations
  • neural network execution

Code Walkthrough

Dataset Structure

The training script expects a dataset structured as:

dataset/
   images/
       image1.png
       image2.png
   captions/
       image1.txt
       image2.txt

Each caption describes the image.

Example caption:

enterprise cloud architecture diagram with API gateway and database

Dataset Loader

The dataset loader reads images and their captions.

Key operations:

  • resizing images
  • converting to tensors
  • normalization

Important section:

transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize([0.5,0.5,0.5],[0.5,0.5,0.5])
])

Normalization is important because diffusion models expect images in a [-1,1] range.


Prompt Encoding

SDXL uses two text encoders.

The function:

encode_prompt_sdxl()

performs:

  1. tokenization of captions
  2. embedding generation
  3. concatenation of embeddings

Important concept:

prompt_embeds = torch.cat([prompt_embeds_1, prompt_embeds_2], dim=-1)

This merges both encoders into a single conditioning representation.


Latent Encoding

Images are encoded into latent space using the VAE.

latents = vae.encode(images).latent_dist.sample()

The VAE compresses images before training the diffusion process.

This drastically reduces memory usage.


Noise Training

Diffusion training consists of predicting noise added to images.

noise = torch.randn_like(latents)
noisy_latents = scheduler.add_noise(latents, noise, timesteps)

The model learns to predict this noise.

Loss function:

loss = F.mse_loss(noise_pred.float(), noise.float())

This is the standard diffusion training loss.


LoRA Configuration

LoRA modifies attention layers of the UNet.

LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["to_q","to_k","to_v","to_out.0"]
)

Key parameters:

Parameter Description


r rank of adaptation alpha scaling factor target_modules attention layers to adapt


Training Loop

Main training steps:

  1. Encode image into latent space
  2. Add noise
  3. Encode text prompt
  4. Predict noise with UNet
  5. Compute loss
  6. Backpropagate

The loop runs for multiple epochs:

for epoch in range(EPOCHS):
    for step,(images,captions) in enumerate(dataloader):

Saving the LoRA Model

After training:

unet.save_pretrained("sdxl_lora")

The LoRA weights can later be merged into the base model.


Image Generation Pipeline

The generation script loads:

  • SDXL base model
  • trained LoRA
  • optimized scheduler

Key configuration:

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0"
)

Scheduler optimization:

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

This significantly accelerates generation.


Memory Optimization

To support Mac Mseries GPUs or limited VRAM:

pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

These techniques reduce peak memory usage.


Prompt Caching

Images are cached using a hash of the prompt.

hashlib.sha256(prompt.encode()).hexdigest()

This prevents regenerating identical images repeatedly.


Image Generation

Image generation call:

image = pipe(
    prompt,
    negative_prompt=NEGATIVE,
    num_inference_steps=25,
    guidance_scale=7,
    height=1024,
    width=1024
).images[0]

Important parameters:

Parameter Meaning


num_inference_steps diffusion iterations guidance_scale prompt strength height / width image resolution


Deployment

Install Dependencies

pip install torch diffusers transformers peft accelerate pillow torchvision

Training

Run:

python train_diffusion.py

Training will produce:

sdxl_lora/

containing LoRA weights.


Running the Generator

python generate_image3.py

Example prompt:

clean enterprise illustration,
corporate presentation slide,
minimal design,
white background,
oracle integration architecture

Generated images are saved in:

images/cache/

Testing

You can test generation with different prompts:

Example:

enterprise microservices architecture diagram

or

cloud integration architecture with api gateway

Conclusion

Diffusion models are revolutionizing image generation by allowing AI systems to synthesize visual content from text descriptions.

Using Stable Diffusion XL combined with LoRA finetuning enables:

  • efficient training
  • domain specialization
  • enterprise use cases
  • automated content generation

In practical systems, diffusion models can be integrated into larger pipelines such as:

  • automated presentation builders
  • documentation systems
  • AIgenerated diagrams
  • marketing automation platforms

With efficient techniques like LoRA, highquality image generation is now accessible even on consumer hardware such as:

  • RTX GPUs
  • Apple Silicon
  • workstation GPUs

As diffusion architectures continue evolving, they will increasingly become a core component of AIdriven content generation systems.

Acknowledgments

  • Author - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
Description
No description provided
Readme 35 KiB
Languages
Python 100%