-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[SANA-Video] Adding 5s pre-trained 480p SANA-Video inference #12584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+2,637
−7
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
13e516c
1. add `SanaVideoTransformer3DModel` in transformer_sana_video.py
lawrence-cj 5eb5354
add a sample about how to use sana-video;
lawrence-cj c6d7876
code update;
lawrence-cj d67ab2a
update hf model path;
lawrence-cj a5f19e0
update code;
lawrence-cj c15ae23
sana-video can run now;
lawrence-cj ee79af3
1. add aspect ratio in sana-video-pipeline;
lawrence-cj f06a93d
Merge branch 'main' into feat/sana-video
lawrence-cj 49557c1
default to use `use_resolution_binning`;
lawrence-cj 857ca30
make style;
lawrence-cj 3ed7000
remove unused code;
lawrence-cj 439bf58
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj de4cf31
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj fe73287
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj 118677a
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj 3546c44
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj f845bba
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj 77714ba
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj b536cfd
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj b0f4866
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj 7205eee
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj e26a35d
Merge branch 'main' into feat/sana-video
lawrence-cj fd5cff2
support `dispatch_attention_fn`
lawrence-cj f2a9d0b
1. add sana-video markdown;
lawrence-cj d98f93c
add two test case for sana-video (need check)
lawrence-cj 4569d0b
fix text-encoder in test-sana-video;
lawrence-cj 1379391
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj b359240
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj 7256023
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj 25d1a4c
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj a9c16eb
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj 8a27d58
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj 4c25427
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj 31c9fa5
Update src/diffusers/video_processor.py
lawrence-cj 0ed7eee
make style
lawrence-cj e31f91b
toctree yaml update;
lawrence-cj cb31fc2
add sana-video-transformer3d markdown;
lawrence-cj 2b8c3e3
Merge branch 'main' into feat/sana-video
lawrence-cj f3c87f4
Apply style fixes
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| <!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # SanaVideoTransformer3DModel | ||
|
|
||
| A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie. | ||
|
|
||
| The abstract from the paper is: | ||
|
|
||
| *We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.* | ||
|
|
||
| The model can be loaded with the following code snippet. | ||
|
|
||
| ```python | ||
| from diffusers import SanaVideoTransformer3DModel | ||
| import torch | ||
|
|
||
| transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) | ||
| ``` | ||
|
|
||
| ## SanaVideoTransformer3DModel | ||
|
|
||
| [[autodoc]] SanaVideoTransformer3DModel | ||
|
|
||
| ## Transformer2DModelOutput | ||
|
|
||
| [[autodoc]] models.modeling_outputs.Transformer2DModelOutput | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| <!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. --> | ||
|
|
||
| # SanaVideoPipeline | ||
|
|
||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/> | ||
| <img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22"> | ||
| </div> | ||
|
|
||
| [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie. | ||
|
|
||
| The abstract from the paper is: | ||
|
|
||
| *We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).* | ||
|
|
||
| This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video). | ||
|
|
||
| Available models: | ||
|
|
||
| | Model | Recommended dtype | | ||
| |:-----:|:-----------------:| | ||
| | [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `torch.bfloat16` | | ||
|
|
||
| Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information. | ||
|
|
||
| Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. | ||
|
|
||
| ## Quantization | ||
|
|
||
| Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. | ||
|
|
||
| Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaVideoPipeline`] for inference with bitsandbytes. | ||
|
|
||
| ```py | ||
| import torch | ||
| from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaVideoTransformer3DModel, SanaVideoPipeline | ||
| from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel | ||
|
|
||
| quant_config = BitsAndBytesConfig(load_in_8bit=True) | ||
| text_encoder_8bit = AutoModel.from_pretrained( | ||
| "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", | ||
| subfolder="text_encoder", | ||
| quantization_config=quant_config, | ||
| torch_dtype=torch.float16, | ||
| ) | ||
|
|
||
| quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) | ||
| transformer_8bit = SanaVideoTransformer3DModel.from_pretrained( | ||
| "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", | ||
| subfolder="transformer", | ||
| quantization_config=quant_config, | ||
| torch_dtype=torch.float16, | ||
| ) | ||
|
|
||
| pipeline = SanaVideoPipeline.from_pretrained( | ||
| "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", | ||
| text_encoder=text_encoder_8bit, | ||
| transformer=transformer_8bit, | ||
| torch_dtype=torch.float16, | ||
| device_map="balanced", | ||
| ) | ||
|
|
||
| model_score = 30 | ||
| prompt = "Evening, backlight, side lighting, soft light, high contrast, mid-shot, centered composition, clean solo shot, warm color. A young Caucasian man stands in a forest, golden light glimmers on his hair as sunlight filters through the leaves. He wears a light shirt, wind gently blowing his hair and collar, light dances across his face with his movements. The background is blurred, with dappled light and soft tree shadows in the distance. The camera focuses on his lifted gaze, clear and emotional." | ||
| negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience." | ||
| motion_prompt = f" motion score: {model_score}." | ||
| prompt = prompt + motion_prompt | ||
|
|
||
| output = pipeline( | ||
| prompt=prompt, | ||
| negative_prompt=negative_prompt, | ||
| height=480, | ||
| width=832, | ||
| num_frames=81, | ||
| guidance_scale=6.0, | ||
| num_inference_steps=50 | ||
| ).frames[0] | ||
| export_to_video(output, "sana-video-output.mp4", fps=16) | ||
| ``` | ||
|
|
||
| ## SanaVideoPipeline | ||
|
|
||
| [[autodoc]] SanaVideoPipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
|
|
||
| ## SanaVideoPipelineOutput | ||
|
|
||
| [[autodoc]] pipelines.sana.pipeline_sana_video.SanaVideoPipelineOutput |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause an error when building the docs since the
api/models/sana_video_transformer3dfile doesn't currently exist. Could you add a markdown doc for the transformer as well? For reference, here is the documentation forSanaTransformer2DModel: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/models/sana_transformer2d.mdThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.