Blog Logo
TAGS

GLIGEN: Open-Set Grounded Text-to-Image Generation

GLIGEN: Open-Set Grounded Text-to-Image Generation by Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee. This work proposes GLIGEN, a novel approach that enhances pre-trained text-to-image diffusion models by incorporating grounding inputs. The model achieves open-world grounded text-to-image generation with caption and bounding box inputs, outperforming existing supervised baselines. GLIGEN utilizes Gated Self-Attention layers to fuse grounding tokens and supports scheduled sampling for inference. Modulated training allows for cost-efficient continual pre-training on large grounding data. Overall, GLIGEN enables versatile grounding capabilities for text-to-image generation.