Making summaries locally - Part #1 Videos

Following my tests of different online chatbots for content summarization, I found it interesting to explain what is possible using only local resources (local LLM or MMLM)
You may be uncomfortable submitting personal data/ confidential files to private external servers, prefer not to upload content to a server under a different jurisdiction, or simply find it more convenient because summarization is an integral component of a larger, locally managed workflow.
You are likely already aware that the models used consume RAM or VRAM, and your capacity for local summarization is a direct function of your machine's resources. For clarity, I am currently utilizing a machine with 24GB of RAM and 32 GB of VRAM. While not small, this is not a high-end specification; significantly better results would be achieved with, for example, 128 GB of RAM. Typically, you must employ a Large Language Model (LLM) that accepts file inputs; in my experience, these LLMs are sufficiently fast when running exclusively on RAM.
My tutorials on using chatbots for summarization have explored two distinct areas:
primarily text summarization, but also video summarization, as some models possess multimodal capabilities (MLLM).
I have already published a tutorial to install a tool that is capable to do video summarization (MiniCPM-V 4.5) and provided some test scripts for video summarization, so I'll start with that.
So since the video summarization part is already pretty long, I have decided to split the local video summarization and the local text summarization.
Luckily, the 2nd part should come soon...
This page https://ollama.com/openbmb/minicpm-v4.5 tells us that the context length of MiniCPM-V 4.5 is 40k.
I wasn't able to find a better reference.
We will utilize these tools to generate a summary of the same video referenced in the previous tutorial.
Please refer to the setup guide at (link) to install the required environment; the tutorial should remain current.
Then:
- ACTIVATE the necessary environment.
- Download the test script here and place the script in the same directory as the "LICENSE" file of "MiniCPM-V".
- Navigate to this directory within your activated environment: [Path/Directory]
- Download the video https://www.youtube.com/watch?v=2dsHuU10udY to this directory.
(I used yt-dlp)
The video is renamed "vid.webm", because I got a webm file.
Then converted to mp4, because the script doesn't handle webm files:
ffmpeg -i vid.webm vid.mp4
This script that I provide has 2 particularities:
- It uses a model located in the directory "MiniCPM-V-4_5" that should be in the same directory as the script.
(not in your usual Huggingface cache)
Again refer to the installation procedure, that i link here.
- If enabled, the script can also handle the sound of the video (the default model is "whisper-large", but it can easily be edited).
I recommend to do a: python video_chat4_seed.py --help
, to see the options of the script:
The most interesting being:
--choose_fps: the number of frames per second that you want to analyze.
If you don't have enough memory or if you want to finish faster, reduce this value, but you may miss the information that is in the frames that you skipped. (lack of information)
--max_frames: limits the input size for one call to the video model in a single packing group.
Reduce if you don't have enough memory. If this value is too small, you may split information that should be in the same packing group (loss of context)
--max_packing: limit to the number of times, the number of frames will be split into packets.
--use_audio: To use audio.
Let's start the script like that, to use the audio as well:
python video_chat4_seed.py --use_audio
Once, the script is loaded, the video is added to the context:
/load_video (path the video)
So in this case: /load_video vid.mp4
You: what is this video about ?
Answer:
Analyzing video with current settings... (this may take a moment)
Transcribing audio...
(+ warnings, the answer may take more time than with an online chatbot)
The video is about the game "Rock, Paper, Scissors".
So that's a bit short, so let's try:
You: Make a summary of the video
[Using seed: 1763602231]
Model: The video is a guide on how to play the game "Rock, Paper, Scissors."
It starts with an introduction by a person seated at a table who explains the objective of the two-player game:
beating the symbol thrown by the opponent in a best-of-three format.
The three symbols are introduced as rock (a closed fist), paper (a flat hand palm down), and scissors
(a fist with two fingers extended).
The video then demonstrates each gesture clearly against a black background:
- A clenched fist representing Rock
- An open hand showing Paper
- Two fingers forming the shape of scissors
Next, it shows both hands together demonstrating the gestures simultaneously.
The explanation returns briefly before transitioning to text on a black background that reads, "'Rock, Paper, Scissors, Shoot!'"
This indicates when players should announce their choice during the game.
Following this, multiple pairs of hands demonstrate different combinations of the gestures
(rock vs. paper, paper vs. scissors, etc.), accompanied by corresponding labels such as
"Rock," "Paper," "Scissors," and "Shoot!" These visuals help viewers understand what each combination
means within the game context.
Finally, there's another demonstration where one pair of hands forms rocks while the other holds up guns
("Shoot!"), leading to a declaration of a winner marked by green text stating, "Winner!"
***
A perfectly acceptable answer...
While I am generally satisfied with the answers of the model, making it a perfectly capable solution, I wasn't able to find settings that return the total number of frames of the video or the moment in time where something happens in the video. The max_frames parameters seems to change this kind of answer. However I didn't find any set of rules that would allow to retrieve the answer correctly.
I asked and it seems that this feature is not implemented in current version of MiniCPM.