first commit

2026-07-10 00:54:21 +00:00 · 2026-03-07 07:13:48 -03:00
commit 350e2f9a30
3 changed files with 728 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,441 @@
 # Training Image Generation Models with Diffusion (Stable Diffusion XL + LoRA)
 ## Introduction
 Image generation using diffusion models has become one of the most
 transformative capabilities of modern artificial intelligence. These
 models are capable of generating high‑quality images from natural
 language descriptions, enabling applications across multiple industries.
 Real-world use cases include:
 -   **Marketing and advertising** -- generating visual assets
    automatically.
 -   **Software documentation and presentations** -- producing diagrams
    and illustrations for technical content.
 -   **Game development** -- generating textures, characters, and
    environments.
 -   **Product design** -- visualizing concepts before prototypes exist.
 -   **Enterprise automation** -- generating architecture diagrams or
    slide illustrations automatically.
 In enterprise environments, diffusion models can be integrated into
 automated pipelines. For example, a presentation generator can
 automatically produce slide images that represent architectural concepts
 such as:
 -   API Gateways
 -   Databases
 -   Integration architectures
 -   Cloud infrastructures
 Instead of manually creating diagrams, an AI pipeline can generate them
 dynamically from structured prompts.
 This tutorial explains how to train and use a **Stable Diffusion XL
 model with LoRA fine‑tuning**, and how to deploy the model to generate
 images programmatically.
 ------------------------------------------------------------------------
 # Technologies Involved
 ## Diffusion Models
 Diffusion models generate images by **iteratively denoising random
 noise**. The training process teaches a neural network how to reverse a
 noise process applied to real images.
 The process works as follows:
 1.  Start with a real image
 2.  Gradually add noise until the image becomes pure noise
 3.  Train a model to reverse this process
 4.  During inference, start with noise and iteratively remove it
 This allows the model to synthesize new images from text descriptions.
 Popular diffusion models include:
 -   Stable Diffusion
 -   Stable Diffusion XL (SDXL)
 -   DALL‑E
 -   Imagen
 In this tutorial we use **Stable Diffusion XL**, which provides:
 -   Higher resolution
 -   Better text understanding
 -   Dual text encoders
 -   Micro‑conditioning
 ------------------------------------------------------------------------
 ## Stable Diffusion XL (SDXL)
 SDXL is an advanced diffusion architecture that improves generation
 quality through:
 -   **Two text encoders**
 -   **Improved conditioning**
 -   **Higher resolution generation**
 -   **Better prompt interpretation**
 Unlike earlier diffusion models, SDXL requires:
 -   two tokenizers
 -   two text encoders
 -   pooled embeddings
 -   time conditioning parameters
 ------------------------------------------------------------------------
 ## LoRA (Low Rank Adaptation)
 Training diffusion models from scratch is extremely expensive.
 Instead, **LoRA** allows fine‑tuning large models efficiently by
 training small low‑rank matrices that modify the attention layers of the
 network.
 Advantages:
 -   Very small training footprint
 -   Works with limited VRAM
 -   Easy to merge into the base model
 -   Fast training
 In this project, LoRA is applied to the **UNet attention layers**.
 ------------------------------------------------------------------------
 ## HuggingFace Diffusers
 The **Diffusers library** provides a high‑level API for working with
 diffusion models.
 It includes:
 -   pipelines
 -   schedulers
 -   training utilities
 -   optimization helpers
 Main components used:
 -   `StableDiffusionXLPipeline`
 -   `DDPMScheduler`
 -   `DPMSolverMultistepScheduler`
 ------------------------------------------------------------------------
 ## PyTorch
 PyTorch is used for:
 -   training loops
 -   GPU acceleration
 -   tensor operations
 -   neural network execution
 ------------------------------------------------------------------------
 # Code Walkthrough
 ## Dataset Structure
 The training script expects a dataset structured as:
    dataset/
       images/
           image1.png
           image2.png
       captions/
           image1.txt
           image2.txt
 Each caption describes the image.
 Example caption:
    enterprise cloud architecture diagram with API gateway and database
 ------------------------------------------------------------------------
 # Dataset Loader
 The dataset loader reads images and their captions.
 Key operations:
 -   resizing images
 -   converting to tensors
 -   normalization
 Important section:
 ``` python
 transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize([0.5,0.5,0.5],[0.5,0.5,0.5])
 ])
 ```
 Normalization is important because diffusion models expect images in a
 **\[-1,1\] range**.
 ------------------------------------------------------------------------
 # Prompt Encoding
 SDXL uses **two text encoders**.
 The function:
    encode_prompt_sdxl()
 performs:
 1.  tokenization of captions
 2.  embedding generation
 3.  concatenation of embeddings
 Important concept:
    prompt_embeds = torch.cat([prompt_embeds_1, prompt_embeds_2], dim=-1)
 This merges both encoders into a single conditioning representation.
 ------------------------------------------------------------------------
 # Latent Encoding
 Images are encoded into latent space using the **VAE**.
    latents = vae.encode(images).latent_dist.sample()
 The VAE compresses images before training the diffusion process.
 This drastically reduces memory usage.
 ------------------------------------------------------------------------
 # Noise Training
 Diffusion training consists of predicting noise added to images.
    noise = torch.randn_like(latents)
    noisy_latents = scheduler.add_noise(latents, noise, timesteps)
 The model learns to predict this noise.
 Loss function:
    loss = F.mse_loss(noise_pred.float(), noise.float())
 This is the standard diffusion training loss.
 ------------------------------------------------------------------------
 # LoRA Configuration
 LoRA modifies attention layers of the UNet.
    LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["to_q","to_k","to_v","to_out.0"]
    )
 Key parameters:
  Parameter        Description
  ---------------- ---------------------------
  r                rank of adaptation
  alpha            scaling factor
  target_modules   attention layers to adapt
 ------------------------------------------------------------------------
 # Training Loop
 Main training steps:
 1.  Encode image into latent space
 2.  Add noise
 3.  Encode text prompt
 4.  Predict noise with UNet
 5.  Compute loss
 6.  Backpropagate
 The loop runs for multiple epochs:
    for epoch in range(EPOCHS):
        for step,(images,captions) in enumerate(dataloader):
 ------------------------------------------------------------------------
 # Saving the LoRA Model
 After training:
    unet.save_pretrained("sdxl_lora")
 The LoRA weights can later be merged into the base model.
 ------------------------------------------------------------------------
 # Image Generation Pipeline
 The generation script loads:
 -   SDXL base model
 -   trained LoRA
 -   optimized scheduler
 Key configuration:
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0"
    )
 Scheduler optimization:
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 This significantly accelerates generation.
 ------------------------------------------------------------------------
 # Memory Optimization
 To support Mac M‑series GPUs or limited VRAM:
    pipe.enable_attention_slicing()
    pipe.enable_vae_slicing()
 These techniques reduce peak memory usage.
 ------------------------------------------------------------------------
 # Prompt Caching
 Images are cached using a hash of the prompt.
    hashlib.sha256(prompt.encode()).hexdigest()
 This prevents regenerating identical images repeatedly.
 ------------------------------------------------------------------------
 # Image Generation
 Image generation call:
    image = pipe(
        prompt,
        negative_prompt=NEGATIVE,
        num_inference_steps=25,
        guidance_scale=7,
        height=1024,
        width=1024
    ).images[0]
 Important parameters:
  Parameter             Meaning
  --------------------- ----------------------
  num_inference_steps   diffusion iterations
  guidance_scale        prompt strength
  height / width        image resolution
 ------------------------------------------------------------------------
 # Deployment
 ## Install Dependencies
    pip install torch diffusers transformers peft accelerate pillow torchvision
 ------------------------------------------------------------------------
 # Training
 Run:
    python train_diffusion.py
 Training will produce:
    sdxl_lora/
 containing LoRA weights.
 ------------------------------------------------------------------------
 # Running the Generator
    python generate_image3.py
 Example prompt:
    clean enterprise illustration,
    corporate presentation slide,
    minimal design,
    white background,
    oracle integration architecture
 Generated images are saved in:
    images/cache/
 ------------------------------------------------------------------------
 # Testing
 You can test generation with different prompts:
 Example:
    enterprise microservices architecture diagram
 or
    cloud integration architecture with api gateway
 ------------------------------------------------------------------------
 # Conclusion
 Diffusion models are revolutionizing image generation by allowing AI
 systems to synthesize visual content from text descriptions.
 Using Stable Diffusion XL combined with LoRA fine‑tuning enables:
 -   efficient training
 -   domain specialization
 -   enterprise use cases
 -   automated content generation
 In practical systems, diffusion models can be integrated into larger
 pipelines such as:
 -   automated presentation builders
 -   documentation systems
 -   AI‑generated diagrams
 -   marketing automation platforms
 With efficient techniques like LoRA, high‑quality image generation is
 now accessible even on consumer hardware such as:
 -   RTX GPUs
 -   Apple Silicon
 -   workstation GPUs
 As diffusion architectures continue evolving, they will increasingly
 become a core component of AI‑driven content generation systems.
 # Acknowledgments
 - **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
--- a/generate_image3.py
+++ b/generate_image3.py
@@ -0,0 +1,69 @@
 import hashlib
 from pathlib import Path
 import torch
 from diffusers import DiffusionPipeline
 from diffusers import StableDiffusionXLPipeline
 from peft import PeftModel
 from project.generate_image import negative_prompt
 CACHE_DIR = Path("images/cache")
 CACHE_DIR.mkdir(parents=True, exist_ok=True)
 pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    dtype=torch.float32
 ).to("cuda")
 pipe.vae.to(torch.float32)
 pipe.unet = PeftModel.from_pretrained(pipe.unet, "sdxl_lora")
 pipe.unet = pipe.unet.merge_and_unload()
 pipe.unet.eval()
 NEGATIVE = """
 text,
 watermark,
 people,
 photo,
 photorealistic,
 complex background
 """
 def _hash_prompt(prompt: str) -> str:
    return hashlib.sha256(prompt.encode()).hexdigest()[:16]
 def generate_slide_image(prompt: str) -> str:
    h = _hash_prompt(prompt)
    image_path = CACHE_DIR / f"{h}.png"
    if image_path.exists():
        print("Image cache hit")
        return str(image_path)
    print("Generating image...")
    image = pipe(
        prompt,
        negative_prompt=NEGATIVE,
        num_inference_steps=30,
        guidance_scale=7,
        height=1024,
        width=1024
    ).images[0]
    image.save(image_path)
    return str(image_path)
 if __name__ == "__main__":
    prompt = """
    clean enterprise illustration,
    corporate presentation slide,
    minimal design,
    white background,
    blue gradient accents,
    expose oracle integration connected to autonomous database and exposing through api gateway
    """
    generate_slide_image(prompt)
--- a/train_diffusion.py
+++ b/train_diffusion.py
@@ -0,0 +1,218 @@
 import torch
 from torch.utils.data import Dataset, DataLoader
 from torchvision import transforms
 from PIL import Image
 from pathlib import Path
 from diffusers import StableDiffusionXLPipeline, DDPMScheduler
 from peft import LoraConfig, get_peft_model
 import torch.nn.functional as F
 from peft import get_peft_model_state_dict
 from diffusers.utils import convert_state_dict_to_diffusers
 # =========================
 # CONFIG
 # =========================
 IMAGES_DIR = Path("dataset/images")
 CAPTIONS_DIR = Path("dataset/captions")
 OUTPUT_DIR = Path("sdxl_lora")
 IMAGE_SIZE = 1024
 BATCH_SIZE = 2
 EPOCHS = 10
 LR = 1e-4
 DEVICE = "cuda"
 # =========================
 # DATASET
 # =========================
 class ImageCaptionDataset(Dataset):
    def __init__(self):
        self.images = sorted([p for p in IMAGES_DIR.glob("*") if p.suffix.lower() in [".png", ".jpg", ".jpeg", ".webp"]])
        self.transform = transforms.Compose([
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
            transforms.ToTensor(),
            # RGB: normalize 3 canais
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
        ])
    def __len__(self):
        return len(self.images)
    def __getitem__(self, idx):
        img_path = self.images[idx]
        caption_path = CAPTIONS_DIR / (img_path.stem + ".txt")
        if not caption_path.exists():
            raise FileNotFoundError(f"Caption não encontrada: {caption_path}")
        image = Image.open(img_path).convert("RGB")
        image = self.transform(image)
        caption = caption_path.read_text(encoding="utf-8").strip()
        return image, caption
 dataset = ImageCaptionDataset()
 dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
 # =========================
 # LOAD SDXL
 # =========================
 pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float32,
    use_safetensors=True,
 ).to(DEVICE)
 # VAE precisa FP32
 pipe.vae.to(dtype=torch.float32)
 # UNet FP16
 pipe.unet.to(dtype=torch.float16)
 # Encoders FP16
 pipe.text_encoder.to(dtype=torch.float16)
 pipe.text_encoder_2.to(dtype=torch.float16)
 # pipe.enable_xformers_memory_efficient_attention()
 tokenizer_1 = pipe.tokenizer
 tokenizer_2 = pipe.tokenizer_2
 text_encoder_1 = pipe.text_encoder
 text_encoder_2 = pipe.text_encoder_2
 vae = pipe.vae
 unet = pipe.unet
 scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
 # Congelar VAE e text encoders (treina só o LoRA no UNet)
 vae.eval()
 vae.requires_grad_(False)
 text_encoder_1.eval()
 text_encoder_1.requires_grad_(False)
 text_encoder_2.eval()
 text_encoder_2.requires_grad_(False)
 # =========================
 # LoRA CONFIG (UNet)
 # =========================
 lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["to_q", "to_k", "to_v", "to_out.0"],
    lora_dropout=0.1,
    bias="none",
 )
 unet = get_peft_model(unet, lora_config)
 unet.train()
 optimizer = torch.optim.AdamW(unet.parameters(), lr=LR)
 # =========================
 # Helpers SDXL conditioning
 # =========================
 def encode_prompt_sdxl(captions):
    # encoder 1
    inputs_1 = tokenizer_1(
        captions,
        padding="max_length",
        truncation=True,
        max_length=tokenizer_1.model_max_length,
        return_tensors="pt",
    )
    # encoder 2
    inputs_2 = tokenizer_2(
        captions,
        padding="max_length",
        truncation=True,
        max_length=tokenizer_2.model_max_length,
        return_tensors="pt",
    )
    input_ids_1 = inputs_1.input_ids.to(DEVICE)
    input_ids_2 = inputs_2.input_ids.to(DEVICE)
    with torch.no_grad():
        out_1 = text_encoder_1(input_ids_1, output_hidden_states=True)
        out_2 = text_encoder_2(input_ids_2, output_hidden_states=True)
        # SDXL usa penúltima hidden layer como token embeddings
        prompt_embeds_1 = out_1.hidden_states[-2]
        prompt_embeds_2 = out_2.hidden_states[-2]
        # concatena embeddings dos dois encoders no eixo de features
        prompt_embeds = torch.cat([prompt_embeds_1, prompt_embeds_2], dim=-1)
        # pooled embed: normalmente vem do text_encoder_2
        pooled_prompt_embeds = out_2.hidden_states[-1][:, 0]
    return prompt_embeds, pooled_prompt_embeds
 def make_time_ids(batch_size):
    # Micro-conditioning do SDXL:
    # [orig_h, orig_w, crop_y, crop_x, target_h, target_w]
    return torch.tensor(
        [[IMAGE_SIZE, IMAGE_SIZE, 0, 0, IMAGE_SIZE, IMAGE_SIZE]],
        device=DEVICE,
        dtype=torch.float16,
    ).repeat(batch_size, 1)
 # =========================
 # TRAIN
 # =========================
 for epoch in range(EPOCHS):
    for step, (images, captions) in enumerate(dataloader):
        # images = images.to(DEVICE, dtype=torch.float16)
        images = images.to(DEVICE, dtype=torch.float32)
        vae.requires_grad_(False)
        vae.eval()
        vae.to(dtype=torch.float32)
        # VAE encode -> latents
        with torch.no_grad():
            latents = vae.encode(images).latent_dist.sample()
            latents = latents * vae.config.scaling_factor
            latents = latents.to(torch.float16)
        # noise + timestep
        noise = torch.randn_like(latents)
        timesteps = torch.randint(
            0,
            scheduler.config.num_train_timesteps,
            (latents.shape[0],),
            device=DEVICE,
        ).long()
        noisy_latents = scheduler.add_noise(latents, noise, timesteps)
        # SDXL conditioning
        prompt_embeds, pooled_prompt_embeds = encode_prompt_sdxl(captions)
        time_ids = make_time_ids(latents.shape[0])
        # UNet predict
        noise_pred = unet(
            noisy_latents,
            timesteps,
            encoder_hidden_states=prompt_embeds,
            added_cond_kwargs={
                "text_embeds": pooled_prompt_embeds,
                "time_ids": time_ids,
            },
        ).sample
        loss = F.mse_loss(noise_pred.float(), noise.float())
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        if step % 10 == 0:
            print(f"Epoch {epoch} Step {step} Loss {loss.item():.4f}")
 # =========================
 # SAVE LoRA CORRETO
 # =========================
 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 unet.save_pretrained("sdxl_lora")
 print("LoRA saved correctly.")