ComfyUI Workflow for AI Drama Production: Complete Guide
ComfyUI Workflow for AI Drama Production: Complete Guide
AI drama production is exploding on TikTok, YouTube Shorts, and Instagram Reels. But without a solid ComfyUI workflow, you're wasting hours on inconsistent characters, janky backgrounds, and outputs that don't match your script. This guide walks through the exact pipeline used by professional AI drama studios to produce 9:16 vertical content at scale.
Why ComfyUI for AI Drama Production?
ComfyUI's node-based architecture gives you granular control over every stage of the pipeline—something Midjourney and DALL·E simply can't match. You need precise character consistency across scenes, controlled lighting, and reproducible workflows. ComfyUI delivers all three.
Key Advantages Over Other Tools
- Full pipeline control: Chain image generation, upscaling, and post-processing in one graph
- Model flexibility: Swap between SDXL, SD3.5, and Flux without leaving the UI
- Reproducibility: Save workflows as JSON files—load, run, done
- Cost efficiency: Local inference on a single RTX 3060 12GB produces 4-6 images per minute
Base Model Selection: Start with SDXL
SDXL 1.0 remains the sweet spot for AI drama production. It handles complex prompts with multiple characters and scene elements better than SD1.5, and it runs on consumer GPUs without the memory overhead of Flux. For a production pipeline, you'll want at minimum:
- Checkpoints: SDXL 1.0 base + a finetuned realistic model (Juggernaut XL or RealVisXL)
- VAE: sdxl_vae.safetensors (default) or a dedicated face-focused VAE
- LoRAs: Character-specific LoRAs (300-500 images, 10-16 epochs)
The Core Pipeline: 6 Essential Nodes
1. Text Encoding with Dual CLIP
SDXL uses two CLIP text encoders—one large (OpenCLIP ViT-bigG) and one standard (CLIP ViT-L). Always route your positive and negative prompts through both. The dual encoding is what gives SDXL its superior prompt adherence.
- Positive prompt: Detailed scene description with character references
- Negative prompt: "deformed, bad anatomy, disfigured, extra limbs, blurry, low quality, watermark, text"
2. KSampler with DPM++ 2M Karras
For consistent outputs across a scene sequence, lock your sampler settings:
- Sampler: dpmpp_2m_karras
- Scheduler: karras
- Steps: 25-30 (above 30 gives diminishing returns on SDXL)
- CFG scale: 7.0 (the sweet spot for creative freedom vs prompt adherence)
- Denoise: 1.0 for base generation, 0.5-0.7 for img2img refinement
3. Latent Upscaling with 4x-UltraSharp
Raw SDXL output at 1024×1024 needs upscaling for 9:16 vertical format. Build a two-stage upscaler:
- First pass: Latent upscale by 1.5x using nearest-exact interpolation
- Second pass: RealESRGAN 4x-UltraSharp model for pixel-level detail recovery
- Output resolution: 1536×2732 (crop to 1080×1920 for TikTok)
4. ControlNet for Scene Consistency
When you need specific poses or compositions across multiple frames, ControlNet is non-negotiable. The most useful models for drama production:
- OpenPose: Lock character poses from a reference frame
- Canny: Preserve composition structure when changing scene elements
- Depth: Maintain consistent scene layout across camera angle changes
- Control weight: 0.5-0.7 for creative control, 0.8-1.0 for strict reproduction
5. IPAdapter for Character Identity
The single biggest pain point in AI drama is character consistency. IPAdapter solves this. Use the plus face model with a reference image of your character:
- Input: One consistent reference portrait per character
- Weight: 0.6-0.8 (too high limits scene flexibility)
- Start at: 0.0 (applied from step 1)
- End at: 0.6-0.8 (let the scene prompt take over in later steps)
6. FaceDetailer Node
Faces degrade during upscaling. The FaceDetailer node detects faces at each upscale stage and applies targeted restoration:
- Detection model: yolov8n-face.pt (fast enough for real-time use)
- Restoration model: GFPGAN v1.4 or CodeFormer (GFPGAN preserves more identity)
- Fidelity: 0.5-0.7 (match your character reference)
- Box expansion: 32px (prevents hard mask edges)
Character Consistency Workflow (The 3-Step Method)
Step 1: Character Sheet Generation
Before producing any scene, generate a character sheet at 1024×1536:
- Three poses: front, three-quarter, side profile
- Three expressions: neutral, happy, intense/serious
- Two outfits: primary and variant
- Seed locking: Generate with 6-8 different seeds, pick the best character interpretation, then lock that seed for every subsequent generation
Step 2: IPAdapter Embedding Extraction
From your chosen character sheet image, extract the IPAdapter embedding once and cache it:
- Extract at 768×768 (IPAdapter clip vision encoder's native resolution)
- Save to .npy file for reuse across all scenes
- This eliminates regeneration inconsistency—every scene uses the exact same character reference
Step 3: Per-Scene Prompt Engineering
Each scene uses a template prompt structure:
- [Character description] — pulled from your character sheet metadata
- [Action/Pose] — specific to this scene
- [Expression] — matching the dialogue emotion
- [Setting] — scene background and props
- [Lighting] — consistent studio lighting across all scenes
- [Camera] — medium shot, close-up, etc.
Example prompt:
"close-up shot of a 35-year-old Chinese businessman in a tailored navy suit, standing in a modern high-rise office overlooking Shanghai skyline at golden hour, professional studio lighting, soft shadows, shallow depth of field, photorealistic, 8k"
Performance Optimization for Local GPUs
Running on consumer hardware? These settings squeeze maximum throughput:
- Batch size: 2-4 images per generation (RTX 3060 can handle batch 2 at 1024×1024)
- VAE tiling: Enable for resolutions above 1024×1024
- Model offloading: Use sequential offload for models over 6GB VRAM
- Smart memory: Enable in ComfyUI-Manager settings
- XFormers: Install for memory-efficient attention (saves 15-20% VRAM)
- Expected throughput: 8-10 images per minute at 768×768, 4-6 at 1024×1024
Production-Ready Folder Structure
Organization is the difference between a demo and a production pipeline. Use this structure:
- /workflows/ — One JSON per scene (character generation, scene generation, upscale)
- /characters/[name]/ — Character sheets, IPAdapter embeddings, LoRA files
- /scenes/[episode]/[scene-number]/ — Raw outputs organized by episode and sequence
- /upscaled/ — Final 1080×1920 rendered frames ready for video assembly
- /references/ — Mood boards, lighting references, style frames
Common Pitfalls and Fixes
- Inconsistent character faces: Your IPAdapter weight is too low. Bump to 0.7 and ensure your reference image has consistent lighting across scenes
- Background bleeding into characters: Increase CFG to 7.5-8.0 or add a ControlNet Depth preprocessor at weight 0.4
- Bloated workflow graphs: Group related nodes (ControlNet, FaceDetailer, Upscale) into collapsed node groups. Your entire pipeline shouldn't exceed 40-50 visible nodes
- Out-of-memory errors: Lower batch size to 1, enable smart memory, and switch to the "normal" VAE decoding mode
Final Tips for Production Speed
- Pre-generate all backgrounds separately using img2img from blank scenes—then composite in post
- Use the ComfyUI "Save/Load Workflow as API" format for headless batch generation on a headless server
- Automate your LoRA training with Kohya SS; train once, reuse across every episode
- Aim for 15-20 seconds of screen time per 2-hour training session—about 90-120 frames at 1080×1920
Building a reliable ComfyUI pipeline for AI drama production takes upfront investment, but once the workflow is locked, you're producing consistent, studio-quality frames at 4-6 images per minute on consumer hardware. That's fast enough for real production.
Ready to create your AI drama series? Contact AI Drama Studio for a free consultation.