The realm of fake stuff continues to be refined by artificial intelligence, with fake text having been mastered a couple years ago with the GPT-3 natural language processing program from startup OpenAI.
Now images, which had achieved substantial fakery thanks to programs such as Nvidia’s StyleGAN, introduced by Tero Karras and colleagues at Nvidia in 2019, got a boost this summer with the announcement by OpenAI of a new program for faking images, DALL•E 2, which builds upon the first DALL•E, released in January of 2021. It can take a phrase you type and convert it into an image, with lots of ways to shape the output image.
This week, OpenAI removed the wait list; anyone can now go to the site to take DALL•E 2 for a spin as long as they are willing to create an account on OpenAI’s website with an email address and phone number.
DALL•E 2’s forte, like its predecessor, is to create images from a text that a person types into a field on the webpage. Type the phrase “an astronaut riding a horse in a photorealistic style,” and an image will appear in roughly that form: a realist rendering of a figure in profile in an astronaut uniform, astride a horse striding against what seems like an image of the cosmos.
The work is described in a research paper by OpenAI scientists Aditya Ramesh and colleagues, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” posted on the arXiv pre-print server.
DALL•E 2 is what is known as a contrastive encoder-decoder. It is built by compressing images and their captions into a kind of abstract, combined representation, and then decompressing them. That training develops the program’s ability to associate text and image.
Ramesh and colleagues’ main point is that the way the compression/decompression happens allows one to do more than simply translate between text and image, it allows one to use phrases to shape aspects of an image, such as adding the term “photorealistic,” which produces something with a certain slick realism.
While the images are still somewhat rough, you can see that DALL•E 2 has the potential to replace a lot of commercial illustration and even stock photography. By typing a phrase, and a style, such as “photo,” you can output a variety of images that may be suitable to illustrate articles.
You can see for yourself by trying it out. Most of the things that leap immediately to mind are funny combos. For example, “A blue whale and a kitten making friends on a beach, digital art” produces the endearing greeting-card style output below.
Four versions are offered at time, and you can download each of them in PNG format.
But it’s also possible to get a number of more banal images that fit a stock photography context. Typing the phrase “A ZDNET contributing writer seeing the future of technology in their own articles by a mountainside hovering in space” produces a kind of sci-fi image that is close to what could accompany an article.
One can add the phrase “realistic picture” and get something a little more slick.
Using the phrase “Photo of very anxious computer user staring at their computer monitor and seeing a Windows patch alert” produced a delightful array of images of typically fearful computer users.
The phrase can be amplified with additional words to get more specific results, such as “Photo of very anxious computer user at their desk staring at their computer monitor and seeing a Windows patch alert.”
Once you start dwelling on stock photography, you’ll find you can come up with lots of scenarios to turn into an image. For example, “Photo of a person with glasses making a point to several people at a conference table in a meeting room” yields a pretty good selection of what look at first blush like real office scenes.
Again, one can get more specific, changing attributes of the scene with a few words, such as “Photo of a person with glasses standing next to a blackboard in a conference room explaining something to their coworkers.”
As you can see, things such as facial features are generally degraded in the DALL•E 2 output.
By applying terms of artists or artistic media or style, one can shift the same image from the realm of stock photography to the realm of illustration, as in the phrase, “Francis Bacon painting of a group of people in a conference room and one person with glasses standing next to a blackboard explaining something.”
Once you create an account, OpenAI gives you 50 “credits,” these are free requests to the system, where each phrase entered counts as one request. Once you’ve used up the 50, you can either wait a month and get the next 15 free credits, or you can buy credits. Credits are sold in packs of 115 for $15 dollars, or 13 cents per credit.
It is possible to stump the program. Some requests may be too much a blending of real and imagined to render in convincing fashion. For example, a request for “rats with blue fur taking over Times Square” produces a decent first attempt, but the fur element gives the picture a sloppy, uneven quality that doesn’t really work.
Other requests may trip up DALL•E 2 because of a choice of a single word.
The request “a bag of money sitting on a lawn chair on a porch overlooking the sunset” generated completely bizarre, unrelated images, such as a close up of toenails, and an ambiguous image that seemed to be some flowers stuck inside a rug.
Substituting the word “placed” for “sitting” allowed DALL•E 2 to produce a satisfactory result in one out of three images.
It may be that the program cannot find a suitable combination of elements for what appears to be an active verb, sitting, when combined with an inanimate object, a sack.
In general, the program seems to struggle with aspects of place, such as “standing in front of an easel.”
Phrases that are not descriptions but questions or interjections seem to boot the system into a random mode. For example, “does DALL•E 2 know its own name?” is an expression that produces several images of flowers. That might be a poetic response, but it feels more like a rejection of the prompt.
There are some guardrails put in by OpenAI, spelled out in the posted content policy, and they will be used to automatically zap any verboten attempts. For example, typing “Microsoft cofounder Bill Gates smoking a cigar in a dumpy apartment with broken down furniture” will not be generated. Instead, an error message shows up stating the request violates the policy and directs you to the policy page. Probably, this is a case of violating the rule “Do not create images of public figures.”
The same request, substituting rather less-well-known public figure Tiernan Ray, a ZDNET contributing writer, generated a selection of amusing images of people who are not Tiernan Ray.
What’s more, copyrighted text seems to be protected from being wholesale infringed. The phrase “a bunch of people hanging out in front of McDonald’s” produces a suitable-enough scene, but every result offered has some slight modification of “McDonald’s” to make it not actually that word.
Where do things go next? Work on the basic approach of text-to-image is proceeding on numerous fronts. One is adding more lexical complexity to the program. For example, Chitwan Saharia and team at Google Brain in May published their work on “Imagen,” a program they say has an “unprecedented degree of photorealism.” The trick was to use a far greater corpus of language materials to train the network.
And there is work being done to broaden the complexity of the kinds of things that a program can make. For example, Google scientists Wenhu Chen and colleagues this month created a program that extends Sahari and team’s Imagen, called “Re-imagen,” which combines the basic idea of compressing text and image together with a third element, search results.
By adding what they call “retrieval,” the program is developed to not just find a “semantic” combination of word and image but to also seek out in internet search results combinations that will fine tune the output. They claim the results are far superior to Imagen and DALL•E 2 in handling rare, obscure phrases such as “Picarones is served with wine,” referring to the Peruvian sweet potato dessert.