Let's talk a little about Text-Image datasettes that contain pictures and text descriptions for them. Why do we need them? For example, for teaching models such as Dall-E (generates a picture according to a textual description), or Clip (knows how to calculate the similarity of text and photographs by projecting text lines and pictures in the same multidimensional space). The same Google Image Search may well use something like CLIP to rank pictures based on the user's text request.
To train Text-to-Image, a model like Clip, of course, you need a marked data pre-ocherillon. For example, Openai trained CLIP for 400m pairs Text-photo, but they didn’t show their dataset. It’s good that at least the weights of the models were laid out on the githabe. In a similar style, the Dall-E model was trained, they chose a subset of 250m pairs of the text-photo and also did not show Dataset to anyone. There is a suspicion that they are afraid of a copy.
The recently sensational ru-dall-e (I wrote about her here) from Sberrot was trained for more than 120m pairs, which approaches 250m from Openai. Sber used all kinds of public datasets. First they took the Sonceptual Captions, YFCC100M, Russian Wikipedia, Imagenet. Then they added Datasets Openimages, Laion-400M, Wit, Web2M, Howto and a pinch of CRULLING of the Internet, as I understand it. All this was filtered to reduce the noise in the data, and all English descriptions were strongly translated into Russian. But, unfortunately, Dataset was not posted either. But they published their trained model, unlike Openai
I’ll also tell you about the new Dataset Redcaps from Justin Johnson (you need to know this name, he is a cool young professor) and his students from the University of Michigan. The guys tried to find on the Internet a new source of free annotations of the type "Text-picture", which contains less garbage compared to downloading the entire Internet in the forehead. It was decided to pump posts with pictures with Reddit-a, and use the signature for the picture and the name of the subreddite as annotations. The names of the Sabreddites are usually talking and already carry a lot of information, for example, "R/Cats", "R/HIGING", "R/FOODPorn". After choosing a suborduster of 350 subreddites and a small filtration, 12m pairs of photo text, which cover 13 years of Reddit history, came out.
On the example of this dataset, you can trace that you are taken to reduce the risk of getting a summons to the court after the publication of such a huge number of data from the forum users. All photos with people were affected by Retinanet and deliberately thrown away, and the Dataset itself was published in the form of a list of links with anonymous text descriptions (pictures on the site are not stored). In addition, there is a form on the site with the dataset where disagreements can be requested to delete any link.
As a result, it turned out to be quite high-quality dataset, which exceeds existing public datasets (except Laion-400M, which was forgotten to compare) to teach the model to generate descriptions according to the picture. As a xperiment, Image Capationing was taught a model on a different dataset and tested on the task of transferring the learned representation to new Datasets in the Zero-SHOT scenarios, as well as through logistics regression on top of features (Linear Probing). But, of course, the model trained by scientists from Michigan is significantly inferior to CLIP for a number of reasons: Clip has a slightly different architecture (it does not generate the text, but teaches emblems), it is thicker and fatter, and trained it 40 times more iterations on a giant dataset, which exceeds In the amount of Redcaps by 35 times.
You can see the examples from Redcaps on the website .
#machinelearning #artificialintelligence #ai #datascience #programming #Technology #Deeplearning