SD XL has been suffering from CLIP – I think this is true, at least partially. Recent models have shown some potential related to natural language, like understanding "left is red, right is blue". However, since CLIP was not trained with natural language sentences, base SD XL and its finetuned variants were significantly limited regarding processing it.
Illustrious XL 3.0–3.5-vpred represents a major advancement in Stable Diffusion XL (SD XL) modeling, notably supporting resolutions ranging seamlessly from 256 up to 2048. The v3.5-vpred variant particularly emphasizes robust natural language understanding capabilities, comparable in sophistication to miniaturized large language models (LLMs), achieved through extensive simultaneous training of both CLIP and UNet components.
Introduction
Illustrious XL 1.0-2.0 series aims to stabilize native generation at 1536 resolution while significantly improving natural language understanding capabilities.
While users sometimes observed successful 1024x1536 resolution generations, these were not stable. Similarly, 512x512 generations occasionally produced unwanted artifacts.