For a video version of this article, click here.
What do you see when you imagine an avocado armchair? What about a harp that looks like a snail? What about the famous HOLLYWOOD sign in LA replaced by the words HOW AM I WEIRD?
The first two images are generated by an AI model called DALL-E, released by OpenAI in January 2021. It is named after the artist Salvador Dalí and the Pixar character Wall-E. The third image HOW AM I WEIRD in the Hollywood hills is something I imagined for our new card game HOW AM I WEIRD (take a look!)
DALL-E, like you, has a fantastic imagination — the word fantastic comes from the Greek word phantastikos, meaning to make visible. DALL-E takes text descriptions and generates images representing the description. The description can be as specific as ‘the exact same cat on the top as a sketch on the bottom’, or one that requires world knowledge, e.g. ‘a photo of a phone from the 1920s’. It understands shapes, colours, 3D perspectives, styles, and even visual reasoning. Below are some examples of DALL-E generated images:
DALL-E is not the only AI model that is able to imagine. A few years ago, Google released Deep Dream that generates hallucinatory dream-like images. However, unlike DALL-E, Deep Dream does not have the ability to connect these images with language.
So how did DALL-E learn to imagine from language? It learned from associating human search queries with image results. Technically, it is a Transformer Language Model with an architecture exactly like GPT-3, except that the training data are sequences of word tokens and pixels instead of words alone. During training, DALL-E reads the word ‘kitten’ and associates it with images of kittens, and reads ‘blue cupcake’ and associates images of blue cupcakes. After seeing millions of pairs of search queries and images, it learns to visualise millions of human (English) words in images.
But you might wonder, how can DALL-E imagine things that humans have never searched? The learning of DALL-E goes beyond simple memorisation — the model learns to extract concepts that can be combined and reshaped. Thanks to the depth in the neural network and the attention mechanism of Transformer, different layers inside the network represent concepts at different levels, ranging from surface form features to more abstract meanings. By seeing millions of images of kittens, blue cupcakes and other objects, it learns concepts such as ‘kitten’, ‘blue’ and ‘cupcake’, and is able to combine them in novel ways and to imagine, for instance, ‘blue kittens’.
The ability to combine multiple concepts to form a new image is how we humans created mythical hybrids between ourselves and other animals: centaurs, griffins, the Minotaur etc. Try imagining a half horse half human holding a sword. Our frontal cortex activates these concepts and instructs the memory and vision parts of the brain to create little snippets of ‘horse’, ‘human’, ‘sword’ and ‘holding’. These little snippets then travel and meet to form a complete image.
In order for this synthesis to happen, the snippets must all meet at the same time (see detailed reviews here and here). However, because they are located in different areas, they have to travel at different speeds in order to meet. The control of these speeds is done by myelin — the fat tissue that surrounds the neurons. Myelination is more active during childhood, which might explain why children are generally more creative than adults.
Imagination sets us apart from animals. Humans are story tellers — we create mythical creatures, gods, giants, whole worlds that we have never seen, and we share these imagined worlds with each other. For example, Dalí imagined clocks like melted cheese in his mind’s eye and we get to experience his imagination through his painting ‘the Persistence of Memory’. Now we have created AI that has overtaken animals in terms of imagination, and will probably soon catch up with humans.
I leave you with this question: what is the purpose of imagination? Of course, creating stories is fun, but does it help us understand the real world? In my opinion imagination — playing with reality — helps us understand reality in a deeper way. Look at the Bull Study (1946) by Picasso. The bulls in the top left represent reality, while those in the bottom right are much more imaginative and abstract, yet we have no problem recognising those lines as representing a bull. By playing with reality, Picasso grasps the few important shapes and drops all the less important details like sifting gold from sand. The perspective shift we gain from imagining allows us to distinguish the essence from the surface; and new combinations enable us to make use of reality to better suit us.
In the same way, I suspect that by learning to map language to images, DALL-E learns language better than if it only read language. In order to link words with pixels, DALL-E has to extract concepts on a level that is neither language nor pixels. This means DALL-E could achieve a representation of meaning that is closer to the core and not dependent on the surface forms.
Whether this is true remains to be tested. But for now, enjoy your fantastic imagination! Stop being a grown-up for a while and let your inner child create weird and wonderful worlds.