What can you ask to MiniCPM-V 4.5 on pictures?
In my latest tutorial, I provided some corrected installation method for MiniCPM-V 4.5 and some test scripts, so it's time to use them to see what we can ask to MiniCPM-V 4.5.
Today I will use the script to test the pictures.
Download the script above and put it in your "MiniCPM-V" directory, then activate your venv environment and run it in a command line prompt from the "MiniCPM-V" directory. (If you are used to use automatic1111 or comfyUI, you know what I mean, so I won't explain more).I would also suggest that you read first the "MiniCPM Model License.md" file (open it with notepad) from the same directory.
When you start the script, the main commands are:
--precision : important if you don't have a lot of VRAM.
--enable_thinking : if you want to test if the "thinking mode" of the model gives you better result.
--seed: if you want to set the seed.
By default the seed is a UNIX timestamp, so if you give this seed to someone, it will indirectly disclose when you used the model. (This could raise some privacy concerns, so I warn you).
The seed seems to be important, since some of the results are going to be in function of the seed.
For example, if I submit a series of pictures of celebrities and I ask the model to recognize them, It happens quite often that the model will recognize more or less who it is but that the answer is going to be approximate. For example, if I feed it pictures of Adrianna Lima, it is able to recognize her (same picture) about 40 to 50% of the time (this is not a mathematical 50%, that's my perception), but very often, it is going to recognize another Brazilian model like Alessandra Ambrosio, or even Kate Moss that doesn't really look like Adrianna Lima. So I would say that the VLM has a good categorization of the pictures, while it is less pertinent on purely facial features.
It is by the way funny that a model that is actually able to recognize correctly a (top) model with another seed INSISTS when you tell it that it is NOT Alessandra Ambrosio.
"Oh but Adriana Lima has blonde hair".
So if you are used to these chat models that tell you that "you are right", you are not going to find this behavior with MiniCPM 4.5.
Also the model has a bias that makes it recognize more American stars than needed, even if I feed it some local Indonesian or Thai celebrity (try with her).
Still, I find MiniCPM-v 4.5 quite capable to recognize famous individuals.
--enable_thinking is going to change your results for the same seed.
It is going to make deductions. I didn't test this mode that much, but the few times I tested it, it was more wrong than with the normal mode, but my perception is not a statistical study.
--image_path this is to load the picture: I tend to not use this way to load the pictures and I (btw) corrected a bug when loading the picture this way. So in case you use this mode to load the pictures and you see some oddity, just load the picture(s) in the chat, instead.
In the chat, the main command is: "new_image", please note that if you have tested my other script to comment on videos that with it, the commands are slightly different.
Also, the script has a small context memory, but the more you ask questions and the slower you will get answers.
The syntax to load ONE picture is:
new_image (local path or url of the picture)
The syntax to load TWO pictures is:
new_image (local path or url of the picture), (local path or url of the picture)
In order to test the abilities of Minicpm, I used pictures that you don't find on the internet, therefore the model wasn't trained on them.
- Image to text from a picture of a typewritten text:
Read the text of a menu: it used a ù instead of a u (it is Italian), but besides this, it was accurate.
In general, for typewritten texts, the model tends to make little errors, it is good enough to be usable for some applications but it is not perfect.
Once it got stuck after 200 characters, but I pressed on "enter" while CMD was the active window to unstuck it and the whole text was then here.
(The problem can be my own implementation.)
More queries on a menu:
"type the text of this picture, but only type the text that is in Italian."
It didn't do it very well.
"type the text of this picture, but don't type the English text."
Better but some English text remains.
"type the text of this picture, translate it to French":
Decent, however it doesn't translate very well some words. "Mirtle" is translated Mûre, while a correct translation would be "Myrte".
Again, the correct concept (edible plant with an M), but it is not exact.
That was an offline translation but it translated the English part of the menu and not the Italian one.
It is said here that Minicpm is multilanguage, however it is probably better in English.
"write the blue text only": not done correctly
"write the RGB code (in hexadecimal) of each color of the text": (gives an answer)
(Then) write only the text with the RGB code: #008000
Then the request is done correctly: so it doesn't call blue what I call blue (still #008000 is green not blue).
So this proves the interest of prompting: A model can't be trained exactly for everything and can't necessarily know what WE want, so one smart way to make the model do what we want is to ask it something that would lead to the same result. So to moderate the value of this rough test of MiniCPM, I could probably do better if I had asked differently.
- Image to text from a handwritten text:
Even if it can somewhat read handwritten text, the results could be a bit poor.
Still it was somewhat able to read (decipher?) quite well my handwriting on a simple note.
But it couldn't read very well my shopping list (bilingual and very poorly written): 1 good item out of 6.
(When I gave it the answer, it didn't insist that it was right).
- Art:
Recognized a kinetic art sculpture correctly but didn't guess the exact author, gave relevant names.
When given a picture of a Lego version of the "Christ the Redeemer" statue, it recognizes it accurately.
- Botanics:
It is surprisingly good for decorative plants, on 5 or 6 pictures, it made one little mistake.
"Describe a picture and the kind of trees": OK
Recognized a Peace Lily (Spathiphyllum) correctly.
Tell that the picture of a plant is a "Delphinium elatum", but admit that I am right when I say it is an aconite.
- Cat species:
Recognized a bread of cat correctly: 3 successes out of 3
- Health:
I didn't have so many health related pictures.
When given a picture of a red pimple over a brown mole, it assesses the red object in a meaningful way (I am not physician), however, when I asked about the brown object, it failed to recognize the mole.
It performed OK on pictures of the internet (recognized chloracne from a picture), but it could have been trained on these. Guessed okay the kind of operation that was performed by an esthetic surgeon.
Also I don't have the medical knowledges to tell that its medical advices are OK.
- Objets:
- Didn't recognize an old picture of a motorbike: tells it is a bicycle.
- Identified better than me a graphic card: this time it was right to insist.
- Sightseeing:
Recognize incorrectly a castle: tells it is Spanish when it is Portuguese.
Recognize incorrectly a church: guessed the style correctly, however it said it is in Lisbon while it is somewhere else in Portugal.
Insist that a building in Switzerland (famous) is located in The Netherlands (wrong!)
Insist that the castle of Rastatt is in fact the Nymphenburg Palace (wrong!)
Again, it shows some accuracy for the style but not for the place.
Recognized correctly a Habsburg canon and the period of the object.
Recognized Neuschwanstein Castle correctly.
- Coding:
Returns some python code but doesn't code in a very meaningful way.
I didn't try the code because it didn't assess correctly my needs (I needed a code for one picture, it provided one where 2 pictures were needed).
Wasn't able to provide very accurate infos at a "find the differences" game.
This could be related to my way to ask the question. However I didn't test the model enough to tell.
Out of this, I can tell that this model is going to perform better when an approximate answer or
a broad concept is enough than when an accurate answer is needed. In botanics, I found it capable, at least for a domestic usage.
If you need an accurate answer and you only have MiniCPM: try with different seeds to see if it answers change or not (you need to restart the script).
Maybe you are just lucky or unlucky with the seed.