neural net - csanyk.com

DALL-E mini is a scaled down implementation of DALL-E, a neural net AI that can create novel images based on natural language instructions. I’ve been having fun trying it out.

I like pugs, so I have told DALL-E mini to make me a lot of pug pictures of various types. A lot of the results look at least halfway decent.

The twitter account @weirddalle posts a lot of “good” (amusing) DALL-E results. Most of which I think are safe to say are novel creations, not something you would expect to find many (if any) examples from a google image search (although, who knows, the internet is a really big, really weird space).

And then I asked DALL-E mini for “a great big pug” and the mystique unraveled for me. I could recognize a lot of familiar pug photos from Google’s image search results page. I tried to go to google to find them, but the current results it gives me are a bit different; I’d guess that the images in the screen cap of DALL-E, below, would have been the top hits for “pug” in Google Image search several years ago.

The four in the upper right corner look especially familiar to me, as though I’ve seen those four images in that arrangement many times before (as I believe I have from searching Google for images of pugs many times.) I feel very confident that I have seen images very close to all nine of them before. Of course now, in 2022, if I search google for images of pugs, I get different results. But I’ve been a frequent searcher of pug pictures for 20 years, and I’m pretty confident that perhaps 5 or 10 years ago, most of the above 9 images were first-page results in Google Images for “pug”.

So, this makes me wonder, is DALL-E merely a sophisticated plagiarist? Is it simply taking images from Google and running some filters on them? For a very generic, simple query, it seems like the answer might be “maybe.”

DALL-E’s source code is available on github, which should make answering this question somewhat easy for someone who has expertise in the programming language and in AI. But I probably don’t have much hope of understanding it myself if I try to read it. I can program a little bit, sure, but I have no experience in writing neural net code.

My guess is that DALL-E does some natural language parsing to guess at the intent of the query, tokenizes the full query to break it up into parts that it can use to search for images, very likely using Google Image search. Then it (randomly? algorithmically?) selects some of those images, and does some kind of edge detection to break down the image’s composition into recognizable objects. We’ve been training AI to do image recognition by solving captchas for a while, although most of that seems to be to help create an AI that can drive a car. But DALL-E has to recognize whatever elements form the Google Image Search results match the token in the query string. Once it does so, it “steals” that element out of the original Google image, and combines it with other recognizable objects from other images from other parts of the tokenized query string, compositing them algorithmically into a new composition, and then as a final touch it may apply one of several photoshop filters on the image to give it the appearance of a photograph, painting, drawing, or what have you.

These results can be impressive, or they can be a total failure. Often they’re close to being “good”, suggesting a composition for an actually-good image that would satisfy the original query, but only if you don’t look at it too closely, because if you do, you just see a blurred mess. But perhaps if DALL-E were given more resources, more data, or more time, it might make these images cleaner and better than it does.

(Mind you, I’m not saying using the set of data that is Google Image Search results is the cheating part. Obviously, the AI needs to have data about the world in order to apply its neural net logic to. But there’s a difference between analyzing billions of images and using that analysis to come up with rules for creating new images based on a natural language text query, and simply selecting an image result that matches the query and applying a filter to it to disguise it as a new image, and then call it a day.)

So, when you give DALL-E very little to work with, such as a single keyword, does it just give you an entire, recognizable image from Google Image search results, with a filter applied to it.

Is it all just smoke-and-mirrors?

I guess all of AI is, to a greater or lesser degree, “just smoke and mirrors” — depending on how well you understand smoke and mirrors. But the question I’m trying to ask really is “just how simple or sophisticated are these smoke and mirrors?” If it is easy to deduce the AI’s “reasoning method” then maybe it’s too simple, and we might regard it as “phony” AI. But if, on the other hand, we can be fooled (“convinced”) that the AI did something that requires “real intelligence” to accomplish, then it is “sophisticated”.

I really enjoy playing around with DALL-E mini and seeing what it can do. It is delightful when it gives good results to a query.

For example: