This article, based on my own experimentations with wan 2.2, explains why you should optimize your settings and your nodes with ComfyUI. This means to not just rely on third-party tools...
At the end, I created my own optimized workflow that I link here to do a "First frame, last Frame" video generation with wan 2.2.
If you are interested in the workflow, this link will redirect you to the related CivitAi page. Note, that I still plan to tune it a bit.
Click on "read more" if you want to know how I did...
I wrote this article because while I was desperately looking (I tried 10+ workflows) for an efficient workflow to make some "first frame last frame" video generations (FLF2V) with ComfyUI using "Wan 2.2" ("Wan 2.2" being a txt2vid or img2vid generation tool that is much more efficient than "Animatediff"). This made me realize how much I was losing my time with what I found.
Before I created my own workflow for ComfyUI, the only workflow for FLF2V that I found on CivitAI AND that was somewhat working on my computer was created by user GFrost (this page can contain NSFW content, please check your settings first on the CivitAI website). On my computer, this workflow was somewhat working but was very slow. At least, it was much slower this other workflow to generate img2vid (image to video) that was working flawless.
On my Windows installation of ComfyUI, I was able to make a generation of 50 frames in about 3 minutes (img2vid), which was excellent. Since user GFrost claims to use a workflow that uses SageAttention, a tool that is supposed to increase the speed of generation, I started a new installation of ComfyUi on a Linux Ubuntu system with SageAttention 2 with Python 3.12, like recommended on this page. I wanted something much faster.
Comment about SageAttention2: For my windows system, I found some very convenient WHL files on this page, compatible with my system to install SageAttention: so you actually don't really need python 3.12. Click here if you need more information.
Please note that at this point, I had no idea if it is easier or if it is harder to make "first frame last frame" videos generations than it is to make img2vid generations. It could just be slow because it is like that: slower than ing2vid generation...
After my new installation, I was able indeed able to make wan 2.2 generations on my Linux system with SageAttention2, but the FLF2Vid workflow didn't need much to make my ComfyUi crash.
I noticed first that the captioning tool that I used to replace "Florence Caption" (that I didn't want to install) was taking a lot of memory and that I didn't really need this part when I was doing my generation. Disabling this captioning tool, that could just be used in another workflow, would save 7 GB of RAM or VRAM and that's a lot.
My generation was running better with these 7 GB of memory that I didn't use anymore, but I was pretty far from the generation speed of the other img2vid workflow. This made me realize that the time that I was losing was cumulative: If I lose 2 minutes per generation and if I need 30 generations to tune my prompt (Wan 2.2 doesn't necessarily know how to do what I want it to do), at the end, I will use much more time to reach the same level of quality. So I understood that optimizing the settings should be the first thing that one efficient person should do !
Testing the sampler method:
I noticed that the img2vid workflow used the "Euler" method, so since I didn't know which sampling method to use, I decided to test them.
Here are my quality criteria for my output video:
- generates a visual appealing content (most important)
- generates a content that respects the prompt
- Consistency: no jump from something to something different with the same generation settings compared to the others.
- Speed, because I notice that there was a difference (this wasn't obvious to me).
Here are some selected and usable results, for one random generation: End at step 4, total steps: 10:
- Euler ancestral: 115 sec /33 frames,
- LCM: 115 sec (good)
- Euler ancestral cfg pp: 183 sec (different movements)
- DPM_2: 188 sec. (not bad)
- DPM_2_ancestral: 188 sec. (decent)
- DPMPP_SDE_gpu: 209 sec (OK, I don't know if it's normal that the "_gpu" version is slower, but this is what I see)
- DPMPP_SDE: 195 sec (the best to respect the prompt, but it doesn't mean that the others don't, could be luck but overall I think it's true)
- ER_SDE: 117 sec (a bit crazy)
- sa_solver: 114 sec (a bit crazy)
- ddim: 114 sec (why not)
- deis_2m_ode: 175 sec (why not)
- res_2s_ode: 202 sec (original)
- res_5s: 451 sec (slow and not very interesting, possible though)
The scheduler being the "simple" one.
(I am right now testing exhaustively the best combinations of samplers and schedulers, I'll publish a new article about what I found.)
Some of these samplers are just on my Linux system. I don't know why...
The generation was done under similar conditions.
(The time given is relative and probably only valid for my system and for "this" generation)
I kept DPMPP_SDE as sampling method just because I liked the result much better, but with an Euler based method, it is faster.
That's my choice because 1 generation of DPMPP_SDE was in average better than 2 generations made with Euler, but that's my perspective...
Testing the number of steps:
Of course, it should (intuitively) takes longer to use more steps, but it changes the result that you get.
My criterion to find the best settings are somewhat the same as above (respect the prompt, keeps consistency (no cut), is faster).
Here are my choices for my FLF2Vid workflow (I will keep the explanations short):
End at step: 2, Total steps: 14
End at step: 3, Total steps: 10-12
End at step: 4, Total steps: 10-12
Less High CFG: the video is more creative.
More High CFG: the video tends to take shortcuts: there is often a cut from the first frames to the last frame and they are not necessarily related.
Lower number of total steps: Gives an impression that a ghost is here and transforms the things.
Higher number of total steps: improves quality, slower.
At the end, my workflow was still slower than the img2vid one (in my memory!) and still tended to make ComfyUi crash...
Short version of the further improvements that I made:
- "Clean VRAM" didn't help at all when the memory was close to saturation: "clean VRAM" was the cause of some crash.
Clean VRAM: removed
- Use of GGUF quantized files when it is possible: They are smaller in file size, but a bit less accurate.
I am not sure of the interest of a quantized clip loader, but if you use one, use an accurate one.
They could write text better but I didn't do any extensive analysis about that.
- Added SageAttention2: wasn't actually in the workflow ! now it is indeed much faster!
- Reconsidering the use of the resampling part: this was a RAM (not VRAM) consuming process.
Worse: when I started to upload my videos on CivitAi, I noticed that the produced files where huge, like 500 mb when a normal video was only 5 MB and these videos needed then to be loaded in the memory to run in fancy ComfyUI.
Since these videos are RAM consuming, the resampling method is used with caution.
- Encoding format of the videos: the format video/h264-mp4 uses the ram.
The video_nvenc_h264-mp4 format should use CUDA, instead.
On the paper it is faster, but with MY workflow I didn't notice any significative gain of performance (merging 100 frames ?).
Then I considered that Wan 2.2, unlike wan 2.1, uses 2 models: one with a low noise and one with a high noise.
This gives the possibility to do the steps at a different moment with fewer things loaded in the memory.
On a theoretical perspective, keeping the models in the fast memory (VRAM instead of RAM) could save time, but I needed then to waste time to load the models in the memory over and over (if I do one generation after the other).
Still. If ComfyUI could crash and I prevent it, the risk profile of the generation is better.
But this could only make sense if I load the models from an SSD drive...
SSD drives are so much faster than HDDs, that I decided to make sure that I only use SSD drives (or the fastest type of drive possible) for my models anyway.
From here, I decided to make a workflow that uses a cache file and that separates the High noise and the Low noise part and I decided to upgrade for good my way to work with these big models:
In 2025, I store them on an SSD drive!
Make sure to open or create the "extra_model_paths.yaml" that is provided in your ComfyUI main directory.
Then, I went back to my Windows installation, where there was no "SageAttention2":
There I was able to run this img2vid workflow with 2 quantized Q6 models together in one single run, several times.
A similar configuration was still making my ComfyUI crash if I run my img2vid workflow twice with Q6 models on Linux.
This means that my Windows 10 system was better than my Linux one. I assessed that SageAttention needed Ram too, despite being faster.
I was considering to make some kind of benchmark:
- with SageDiffusion and loading the models all the time
- Without SageAttention but keeping the models in the memory...
However, If SageAttention is disabled on my Linux, it still tends to crash more than on Windows 10. (That's what I see: it can't handle several runs with the 2 Q6 GGUF files together on my Linux system.)
On Windows, ComfyUi tends to crash less (or not crash at all), but usually the price is that the system could become next to unusable instead of crashing, so I am not sure it is always better, but in my case, it is.
So, I have now a workflow that is supposed to separate the "high noise" and the "low noise" part, but if I load the models all the time,I waste time.
So my next solution was to do all the first parts at the same time to get several cache files.The cache files are then used later to do the second parts. Still, you need to remember what you do or you could make mistakes.
This means to remember the settings for when you pass the second part. The workflow that uses several cache files is more tedious to use, but if you don't have a graphic card with less than 40 GB of VRAM (ideally I would even need more than a 5090 RTX), then the result is probably faster on your card, as long as you are serious with your workflow.
I uploaded a workflow on a CivitAI page to do this FLF2Vid generation. However you should do your own tests and try to tune the workflows that you are going to use a lot. It is not because it is provided (even in the "ComfyUI" workflow menu or on ComfyUI's blog) that what you get is what you should use if you aim to work with efficacy.
This tutorial is not exhaustive on many points.
Further improvements:
The img2vid model uses only 2, 4 steps.
The default settings of my FLF2VID are 4, 10 steps.
At this point, I didn't try to reduce further more the number of steps on my FLF2vid script. This would mean to copy further the workflow of the img2vid version and add: "NAG" nodes and temporal attention to reduce the number of steps.
I need to test the effect of the nodes that are used in the img2vid workflow, to see if it is compatible with my needs for FLF2Vid. I didn't rush, because it is not necessarily faster or better.
I tried to use this 4 steps lightning Lora and I wasn't convinced.
In particular, now speaking retrospectively, where I have a similar generation speed as with the img2vid model but with more steps. Also the way the prompt is rendered with accuracy constantly is more important than the raw speed because if it is not, you have to start over more often.
Next steps (soon!): screening of the best combination sampler/scheduler.