Build a picture search engine using CLIP.
It is probably very easy to make an image search engine using CLIP:
just ask your favorite LLM (ChatGPT, Gemini, Grok, Claude...) to make one... 'et voilà!'.
Alternatively, I could just ask an AI to write code for me, then I could simply comment the code as if we were still in the old days when we had to code manually.
I don't think either of these ideas is interesting by modern standards.
Still, I decided to write an article about CLIP models because they use an interesting but straightforward concept.
CLIP associates a Vision encoder to extract visual features with a text encoder to extract semantic meaning.
Why should you use a CLIP model to build an image search engine instead of a VLM?
A VLM could certainly be used to build an image search engine; perhaps it is the future.
After all, a VLM can answer questions more accurately than CLIP models and describe images better than BLIP models.
So, why not just enumerate images and ask a question to a VLM?
The answer is that CLIP models are significantly faster than VLMs.
If you provide a query to a CLIP model that has already embedded (or vectorized) images, it can retrieve the image that best matches a title in seconds, even among one million other images.
You will not be able to achieve that speed with a VLM; with a VLM, processing the same number of images would take all day.
Beyond CLIP models that associate text and images, there is a more fundamental concept: associating media with text.
For example, there is CLAP (Contrastive Language-Audio Pretraining), which associates audio files with text.
This allows you to determine the most likely musical genre of an audio file.
To answer specific questions about books, you can vectorize chunks of text (instead of images like in CLIP) and retrieve the most relevant ones.
This is a core part of Retrieval-Augmented Generation (RAG), where an LLM rephrases retrieved chunks to answer a user.
However, for summarization, you must retrieve and synthesize multiple key chunks or use a hierarchical approach, as a single chunk rarely captures the entire narrative.
Alternatively, you could vectorize cake recipes for ingredient substitution, predict texture based on ingredient proportions, or identify ingredients by submitting a picture of the final cake...
All of this, with the same concept...
How are CLIP models trained ?
The first models were trained on 400 million image-text pairs; the text was often the alt-text and captions from pictures scraped online.
The 'intelligence' is supposed to come from this massive amount of data.
However, this data source is noisy and relies on how people describe the pictures, as nothing prevents humans from mislabeling them.
Naturally, among these pictures, there are certainly images of celebrities.
Therefore, if you query a CLIP model with a celebrity's name, it can likely find them among a huge number of unrelated pictures—unless the data has been filtered, which is likely the case with Google's SigLIP.
In any case, it is important to know what kind of data has been used when selecting a CLIP model and how the model was trained.
Usually, a CLIP model tries to match text with the whole image. Just because you have a picture of 'Barack Obama playing golf in New Jersey,' it does not mean 'Barack Obama' is going to be the best match for the picture; that is just a part of the image.
Also, CLIP models have been trained on a fixed resolution; very often it is 224x224 (I suggest checking the Wikipedia page for these models to find the specific resolutions used). A CLIP model that uses a resolution of 384 pixels is more likely to be more sensitive to details.
This means that just because you pay attention to a small object that is highly visible against a bland background, that a CLIP model will do the same. CLIP models pay attention to the whole image and capture fewer details than the original high-resolution picture. One procedure to make CLIP detect smaller objects is tiling: the picture is cut into sub-parts (tiles), and CLIP analyzes each part individually. Honestly, when I tested this, it didn't work very well, but conceptually, I know it should work.
CLIP models are trained using a process called 'contrastive learning.' You have a text encoder that encodes the text and places the result on one axis of a matrix. A vision encoder encodes the image and places the result on the other axis.
The model is mathematically 'punished' if the similarity on the diagonal is low, and 'punished' if the similarity anywhere else is high. This creates a semantic agreement between text and pictures and allows 'zero-shot' retrieval. You don't need to train the model with a 'dog with a hat' to retrieve a picture of a dog with a hat. The model needs to know what a dog is and what a hat is, and it can manage from there.
From the total number of images, you divide them into batches of a given size. This batch size determines the dimensions of the similarity matrix during training. From each encoder, you produce an output vector via a projection head, which acts like a bridge to a shared space. Its dimension is 512 for ViT-B/32, 768 for ViT-L/14, and 1280 for SigLIP (ViT-bigG).This number represents the 'width' of the mathematical signature for each image; while these dimensions aren't individual words, they capture the combined semantic traits that allow the model to associate pictures with language.
Since CLIP models cannot generate text like BLIP models, if you want CLIP to associate attributes with images, it is useful to have an exhaustive wordlist. From this, you can retrieve the best keywords associated with an image.
Depending on what you are trying to do, you could use an LLM with these keywords to imagine a scene.
However, if you need to be accurate regarding the image content, I would use a VLM instead.
From here, I am gonna give you the code that I produced with Gemini and what was important in my opinion.
First of all, it is important to understand that the most time-consuming part of working with CLIP models is not searching for the images that best match a concept, but rather embedding (vectorizing) the images themselves. That is why you need to implement a cache in your source code.
Furthermore, the code must be adapted to your RAM and graphics card. If you can embed larger batches at once, you will likely save time by increasing throughput, provided you do not saturate your VRAM and cause the system to crash or slow down. (that's a risk of my script).
My picture search engine script utilizes three different CLIP models. I personally prefer SigLIP as it feels the fastest; however, in my testing, it failed to recognize George W. Bush on a test picture, so I wouldn't recommend it for searching for specific people.
ViT-B/32 is the fastest overall and uses less memory, but it only employs 512-dimensional vectors and is less accurate than the others. ViT-H-14 is a more balanced choice for accuracy, though it is the slowest—which may not necessarily be a drawback depending on your needs.
CLIP and SigLIP differ because SigLIP uses a 'sigmoid approach'—the sigmoid being an S-shaped curve.
SigLIP answers a binary question for every image-text pair: 'Is this a match?'
In contrast, CLIP models use a 'Softmax approach', where the softmax is also a mathematical curve.
A CLIP model looks at all captions in a batch simultaneously and asks: 'Which one of these is the best fit?'
While the sigmoid and softmax curves may look similar in a picture search engine, they are mathematically distinct.
If you estimate how well a picture matches a text using a score, you will notice that Sigmoid scores tend to be lower than Softmax scores. Softmax models may simply return the 'least wrong' answer because they are forced to pick a winner.
Sigmoid scores, on the other hand, are more like shades of yes or no.
(My code uses cosine similarity, a measure of the angle between the query vector and the image vector. That is why the score of the same picture does not change, even when placed with other pictures.)
Besides this, the script features some basic functions to move or copy the pictures elsewhere.
How to use the image search engine script ?
You need an environment to use pytorch, open-clip-torch, transformers, pillow, sentencepiece, protobuf and psutil .
If you need to install one, click on one of the setup script (DON'T run it from a powershell CLI on Windows).
To run the environment, use of of the "activate" script.
Then run: python clip_picture_search_engine.py --help
To know more about the features of the script.
Then if you want to run the script with the default settingss, use just:
python clip_picture_search_engine.py
The script will ask you what it needs.