As AI increasingly dominates the narrative in technology and business, most people’s understanding of it remains limited to tools like ChatGPT. However, one rapidly advancing area is AI image generation. You may be familiar with some tools in this space, but I aim to examine how different image generation models respond to the same prompt.
First, let’s briefly explore how AI image generation works and the mechanical differences between AI text and image generation.
How do image generation models work? Models like DALL-E are trained using vast datasets of images and, in some cases, accompanying text descriptions. During training, the AI is fed millions of image-text pairs, learning associations between words and visual concepts. When given a text prompt, the model generates a corresponding image by synthesizing pixels in alignment with the patterns and visual relationships from its training data. Essentially, the AI acts like a painter, creating ‘brush strokes’ based on its database of image-text pairs. This process can lead to bias, which we will explore further in this article.
How do text generation models work? In contrast, text-based AI models, such as GPT-4, are trained on extensive text data, learning language patterns, grammar, and context. When prompted, they generate text by predicting the most likely next word or phrase based on the input and their training, essentially ‘guessing’ the best next words based on your input.
The key difference between image and text generation is that AI must interpret your words and visualize the concept you present.
Testing Image Generation with the Same Prompt
One pitfall of image generation is that limited training data can lead to divergent or biased outputs. As a Bay Area-based contributor, I tested the same prompt across four different image generators: “An image of 4 friends drinking wine in Napa, CA on a sunny day.”
For this test, I used:
I restricted the test to the ‘first image’ output from each model, as those familiar with these tools know they generate multiple images per prompt. For Dall-E and Imagen, I accessed the images through Canva, which has separate apps for both. Here were the results:
The outputs tended to converge on similar imagery. Notably, Midjourney showed the most divergence among the four results, followed by Firefly. The outputs from Dall-E and Imagen were relatively similar based on anecdotal observations.
While image generation technology is advancing rapidly, it raises concerns about bias and other potential issues. As training data expands, these models will improve. However, with video generation nearing mainstream adoption through companies like Runway and Pika, extra caution is necessary when relying on text-to-image and text-to-video outputs to avoid reinforcing societal biases.
Read the full article here