As a way to experiment with recent generative AI tools, I challenged myself to design a piece of clothing for each color of the rainbow. The results are a sort of fashion line with a theme of bold, angular patterns.
I experimented with a variety of tools and approaches, but all of the above were generated using free tools based on Stable Diffusion XL: either the macOS app Draw Things or the open source project Fooocus. I also used Pixelmator Pro in a few cases to fix small issues with faces, hands, and clothing via more classic photo editing techniques.
Each image was selected from around 5 to 50 alternatives, each of which took between 1 to 6 minutes for the system to generate (depending on hardware and settings). So the gallery above represents at least 10 hours of total compute time.
In some cases, I needed to iterate repeatedly on the prompt text, adding or emphasizing terms to guide the system towards the balance of elements that I wanted. In other cases, I just needed to let the model produce more images (with the same prompt) before I found one that was close enough to my vision. In a few cases, I used a promising output image as the input for a second round of generation in order to more precisely specify the scene, outfit, and pose.
It’s impressive to see how realistic these tools are getting, though they certainly have limits. If you specify too many details, some will be ignored, especially if they do not commonly co-occur. I also started to get a feel for the limits and biases of the training data, as evidenced by how much weight I needed to give different words before the generated images would actually start to reflect their meaning.
It’s also clear that the model does not have a deep understanding of physics or anatomy. AI-generated images famously struggle with hands, sometimes using too many fingers or fusing them together. It also often failed to depict mechanical objects with realistic structure — I more or less gave up trying to generate bicycles, barbells, and drum sticks.
Overall, the experience of generating the fashion gallery felt less like automation and more like a new form of photography. Rather than having to buy gear, hire a model, sew an outfit, and travel to a location, you can describe all those things in words and virtually take the photo. But you still need the artistic vision to come up with a concept, as well as the editorial discretion to discard the vast majority of images — which is also the case in traditional photography.
Last, it was interesting to notice that the process of adjusting prompts and inspecting results was not so different than trying to communicate with another person. You’re never sure exactly how your words will be interpreted, and you sometimes need to iterate for a while to come to a shared understanding of “less like that” and “more like this”.