Rethinking Layered Graphic Design Generation with a Top-Down Approach

Abstract

The proposed Accordion is a framework that converts AI-generated graphic designs into editable layered designs while replacing nonsensical AI text with meaningful content guided by user prompts. Unlike bottom-up methods (e.g., COLE, Open-COLE) that build layers element by element, Accordion adopts a top-down approach, using a reference image as global guidance to decompose layers. It employs vision language models (VLMs) across three stages and integrates vision experts like SAM and inpainting models to extract objects, backgrounds, and text. Trained on the Design39K dataset (augmented with refined AI-generated samples), Accordion achieves strong results on the DesignIntention benchmark for tasks such as text-to-template, text addition, and text de-rendering, producing editable and visually coherent graphic designs.

Key Contributions 🎯

Novel Top-Down Approach: Unlike existing bottom-up methods that build designs element by element, Accordion takes a top-down approach by decomposing AI-generated designs into editable layers.
VLM-Powered Framework: Built around a vision language model that plays distinct roles across three carefully curated stages for layered design generation.
Text Refinement: Automatically refines nonsensical AI-generated text with meaningful alternatives guided by user prompts.
Vision Expert Integration: Leverages multiple vision experts including SAM and element removal models to facilitate high-quality layer creation.
Design39K Dataset: Introduces an augmented in-house graphic design dataset with AI-generated designs and refined ground truth.

Method Overview

Accordion operates in three main stages:

Stage 1: Reference Generation - Generate visually harmonious design images from user prompts using text-to-image diffusion models.
Stage 2: Layer Decomposition - Use VLM to analyze the reference image and decompose it into constituent layers (text, objects, background) guided by specialized prompts.
Stage 3: Layer Refinement - Refine each layer using vision experts (SAM for segmentation, inpainting models for background extraction) and replace AI-generated text with meaningful content.

This top-down approach ensures global visual harmony while maintaining full editability of the final design.

Figure 2 (Stage 2): Layer Decomposition – VLM analyzes and splits the reference into layers (text, objects, background).

Figure 3 (Stage 3): Layer Refinement – each layer is improved using vision experts such as SAM and inpainting for high-quality editable results.

Applications 🚀

Accordion excels in multiple graphic design tasks:

Text-to-Template: Generate fully editable design templates from text prompts
Adding Text to Background: Intelligently place and style text on existing backgrounds
Text De-rendering: Extract and replace text while preserving design aesthetics
Design Variations: Create multiple design variations while maintaining visual coherence

Text-to-Template: Generate editable design templates from a text prompt.

Adding Text to Background: Place and style text on existing backgrounds.

Text De-rendering: Extract and replace text while preserving aesthetics

Text De-rendering: Extract and replace text while preserving design aesthetics.

Design Variations: Create multiple variations while maintaining coherence

Design Variations: Create multiple variations while maintaining visual coherence.

Why Top-Down? 🤔

Traditional bottom-up approaches (like COLE and Open-COLE) face several challenges:

Difficulty in maintaining global visual harmony when building designs element by element
Limited ability to ensure consistent color schemes and layout balance
Challenges in generating contextually appropriate text content

Our top-down approach addresses these issues by:

Starting with a visually complete reference that ensures global harmony
Using the reference as guidance for intelligent layer decomposition
Maintaining design coherence throughout the generation process