Architecture DallE 2

May 14

One of the main events of recent times is, of course, the output of Dall-E 2.

Dall-E 2 is a new model for generating images according to a text description (and not only, more on this in more details in the posts below), which shows the results of simply chic quality 🔥

Let's try to figure out what it is, how it works and stroke for results =)

1. The path from Dall-E to Dall-E 2

If you remember, Dall-E is a model for generating images according to a text description. The first version of Dall-E presented Openai at the very beginning of 2021. This model was based on the architectures of VAE and Transformer and it could be very rudely described as "GPT-3 to generate pictures." You can read more about the first Dall-E on the Openai blog and in the article on the archive.

Around the same time, the same Openai released CLIP - a multimodal model that works with pictures and text and mapping pictures and text into one common space of embedding. Such a model can be used to solve many problems related to pictures and texts in Zero-Shot mode (i.e. without training). For example, you can force Clip to solve the problem of classifying pictures by simply presenting the names of classes in the form of a text in a natural language. I wrote more about this and CLIP in the post above.

Also, recently, the popularity in the generation of images began to gain diffusion models (Diffusion Models). At the time of the fall of 2021, the Palette diffusion model took SOTA on several different images generation tasks at once. We wrote more about this in the post above.

Soon after this, closer to the end of 2021, Openai releases the Glide model. Like Dall-E, this is a model for generating images according to a text description, but it works a little better than Dall-E. Glide base is diffusion models. Due to this, Glide can not only generate pictures better than its predecessor, but also solve the accompanying problems, such as the addition of parts of the picture (Image Inpainting). Read about Glide in this post.

And finally, April 2022, our time. Openai produces Dall-E 2. This model is based on Clip + Diffusion Models models. As you can see, this is not quite a “second version of Dall-E”, since the ideas of their work are completely different =) however, in terms of the quality of generation of images, Dall-E 2 is simply chic, and according to this parameter, you can really consider Dall-E 2 “The following, second generation of generative models "

2 architecture Dall-E 2

Dall-E 2 combines the ideas of CLIP and diffusion models.

The principle of operation of Dall-E 2 is very simple: to use CLIP to generate an enollowing text, and then generate an image using a diffusion model due to this embedding text. The diffusion model and CLIP are studied together.

Illustration of the architecture of the model - in the picture to the post. The upper part (above the dashed line) is an illustration of clip training, the lower part is the training of the diffusion model.

The principle of operation more:

The training dataset consists of steam {x, y} = {image, its text description}.

- drive the X via Clip, we get an embedding t;

- drive Y through Clip, we get an embeding z (blue in the picture);

- teach CLIP to correlate the embedding t and z;

- Embedding Z is submitted to Prior, which by test embedding Z generates the emergency image T;

- Embedding T is supplied to the input of the diffusion model that generates the picture.

Read more about the principle of work - in an article from Openai

3. The results of Dall-E 2

Since Dall-E 2 is based on the ideas of CLIP and diffusion, the model can not only generate pictures according to the text description, but also perform related tasks. For example:

- supplement parts of the image;

- change the details of the images, supplement the picture with details

- cross several images (including generating animations with a smooth transition)

(^ by links-posts with examples of Dall-E 2 on these tasks)

And here are a few more examples of Dall-E 2, from which are just goosebumps. Well, how good the generation is!

- Hams-loaders

- Demoterenization of the iPhone

- Artbook of 100 images of robots by Dall-E 2

- Animal helicopters

- dogs with pitchfork and pizzas (it seems, this is a cosplay of some kind of famous picture))

#Machine Learning #Neuronnesetti #computerscience #codinglife #reactjs #Technology #FronTend #Development #PROGRAMMERS #Deep teaching

Architecture DallE 2

Popular Posts

From family

From How TO

From Smartphones

From Top10

Categories

You may like