sebastiankamph

SVD Image2Video. Stable Video Diffusion subscriber guide.

Added 2023-12-02 21:17:10 +0000 UTC

In this guide we’ll be focusing on getting Stable Video Diffusion to work in ComfyUI. Once available in automatic1111, this guide might be updated, or a separate guide created. This guide is accompanied by the Youtube video found here: https://youtu.be/HOVYu2UbgEE

If you need help installing ComfyUI, I recommend this video: https://youtu.be/KTPLOqAMR0s

For Stable Video Diffusion locally you will have to download an SVD model. You only need to download ONE of these, but you can test multiple. The current options are:

Svd.safetensors (14 frames) <- Download

Svd_xt.safetensors (25 frames) <- Download

SVD: This model was trained to generate 14 frames at resolution 576x1024. The standard image encoder from SD 2.1 is used, but replace the decoder with a temporally-aware deflickering decoder.

SVD-XT: Same architecture as SVD but finetuned for 25 frame generation.

Download the models and place in your models folder. These are found:

For automatic1111 /models/Stable-diffusion/

For ComfyUI /models/checkpoints/

I recommend renaming svd to svd_14frames.safetensors so you remember that this model is intended for 14 frame length.

I recommend renaming svd_xt to svd_xt_25frames.safetensors so you remember that this model is intended for 25 frame length

Comfy has released a base workflow to get you started with SVD, I recommend first testing this workflow before you delve deeper.

Download here: https://comfyanonymous.github.io/ComfyUI_examples/video/workflow_image_to_video.json

Drag & drop this into your ComfyUI

Select your svd model in the checkpoint loader

Drag & drop an image into the load image box. For best results, use 1024x576 or aspect ratio 16:9. This is the same aspect ratio as 1920x1080

VideoLinearCFGGuidance: This node will gradually change the CFG through the video. The video will start at this CFG level and then change towards the CFG set in the ksampler. This will help with the video quality. Leave default if unsure.

Width & height: The size of your output video. 1024x576 is the default size trained for this model, but other sizes can be used also.

video_frames: This tells you how many frames, or images, are used to make the video. For best results, use the same as your model, ie. 14 or 25

motion_bucket_id: Higher numbers mean more movement in the video.

fps: Frames per second - higher numbers make the video smoother. Good default values are 6, 12, 25.

augmentation level: This scales how much your video will change from your starting image. It’s actually how much noise is input to give stable diffusion more to generate from. Higher value = more change. Lower value = less change. Leave this value low or your image will start breaking.

In the KSampler node, the generation gets done.

Seed: Which seed, or starting noise, is used to generate. If you want to test settings you can set seed at fixed to see your changes. If you want a new generation each time, make sure it’s set to randomize.

Steps: The amount of sampling steps for each frame. 20-50 is a good value here. I usually use 20-30

Cfg: This correlates to the VideoLinearCFGGuidance in previous node. This is the ending cfg, leave default if unsure.

Sampler_name: Your preferred sampler. You can test various, for example euler, euler_e, dpm++ 2m karras. The sampler is the tool that will create your images. Euler has been tested to work very well with SVD.

Denoise: In a similar way to augmentation level, this will change how much is changed from your initial image. Leave at default 1.

Your finished video will be saved as a webp. I recommend loading the VHS_VideoCombine node (requires custom install from manager) which will give you more format options.