Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
13e516c
1. add `SanaVideoTransformer3DModel` in transformer_sana_video.py
lawrence-cj Sep 9, 2025
5eb5354
add a sample about how to use sana-video;
lawrence-cj Sep 9, 2025
c6d7876
code update;
lawrence-cj Sep 11, 2025
d67ab2a
update hf model path;
lawrence-cj Sep 11, 2025
a5f19e0
update code;
lawrence-cj Oct 3, 2025
c15ae23
sana-video can run now;
lawrence-cj Nov 4, 2025
ee79af3
1. add aspect ratio in sana-video-pipeline;
lawrence-cj Nov 4, 2025
f06a93d
Merge branch 'main' into feat/sana-video
lawrence-cj Nov 4, 2025
49557c1
default to use `use_resolution_binning`;
lawrence-cj Nov 4, 2025
857ca30
make style;
lawrence-cj Nov 4, 2025
3ed7000
remove unused code;
lawrence-cj Nov 4, 2025
439bf58
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
de4cf31
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
fe73287
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
118677a
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj Nov 5, 2025
3546c44
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
f845bba
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
77714ba
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
b536cfd
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj Nov 5, 2025
b0f4866
Update src/diffusers/models/transformers/transformer_sana_video.py
lawrence-cj Nov 5, 2025
7205eee
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj Nov 5, 2025
e26a35d
Merge branch 'main' into feat/sana-video
lawrence-cj Nov 5, 2025
fd5cff2
support `dispatch_attention_fn`
lawrence-cj Nov 5, 2025
f2a9d0b
1. add sana-video markdown;
lawrence-cj Nov 5, 2025
d98f93c
add two test case for sana-video (need check)
lawrence-cj Nov 5, 2025
4569d0b
fix text-encoder in test-sana-video;
lawrence-cj Nov 5, 2025
1379391
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
b359240
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
7256023
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
25d1a4c
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
a9c16eb
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
8a27d58
Update tests/pipelines/sana/test_sana_video.py
lawrence-cj Nov 5, 2025
4c25427
Update src/diffusers/pipelines/sana/pipeline_sana_video.py
lawrence-cj Nov 5, 2025
31c9fa5
Update src/diffusers/video_processor.py
lawrence-cj Nov 5, 2025
0ed7eee
make style
lawrence-cj Nov 5, 2025
e31f91b
toctree yaml update;
lawrence-cj Nov 5, 2025
cb31fc2
add sana-video-transformer3d markdown;
lawrence-cj Nov 6, 2025
2b8c3e3
Merge branch 'main' into feat/sana-video
lawrence-cj Nov 6, 2025
f3c87f4
Apply style fixes
github-actions[bot] Nov 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,8 @@
title: QwenImageTransformer2DModel
- local: api/models/sana_transformer2d
title: SanaTransformer2DModel
- local: api/models/sana_video_transformer3d
title: SanaVideoTransformer3DModel
Comment on lines +376 to +377
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause an error when building the docs since the api/models/sana_video_transformer3d file doesn't currently exist. Could you add a markdown doc for the transformer as well? For reference, here is the documentation for SanaTransformer2DModel: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/models/sana_transformer2d.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- local: api/models/sd3_transformer2d
title: SD3Transformer2DModel
- local: api/models/skyreels_v2_transformer_3d
Expand Down Expand Up @@ -563,6 +565,8 @@
title: Sana
- local: api/pipelines/sana_sprint
title: Sana Sprint
- local: api/pipelines/sana_video
title: Sana Video
- local: api/pipelines/self_attention_guidance
title: Self-Attention Guidance
- local: api/pipelines/semantic_stable_diffusion
Expand Down
36 changes: 36 additions & 0 deletions docs/source/en/api/models/sana_video_transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# SanaVideoTransformer3DModel

A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.

The abstract from the paper is:

*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.*

The model can be loaded with the following code snippet.

```python
from diffusers import SanaVideoTransformer3DModel
import torch

transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
```

## SanaVideoTransformer3DModel

[[autodoc]] SanaVideoTransformer3DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput

3 changes: 0 additions & 3 deletions docs/source/en/api/pipelines/sana_sprint.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,6 @@ The abstract from the paper is:

*This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.*

> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj), [shuchen Xue](https://github.com/scxue) and [Enze Xie](https://github.com/xieenze). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model/).

Available models:
Expand Down
102 changes: 102 additions & 0 deletions docs/source/en/api/pipelines/sana_video.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

# SanaVideoPipeline

<div class="flex flex-wrap space-x-1">
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
<img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22">
</div>

[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.

The abstract from the paper is:

*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).*

This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video).

Available models:

| Model | Recommended dtype |
|:-----:|:-----------------:|
| [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `torch.bfloat16` |

Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information.

Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.

## Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaVideoPipeline`] for inference with bitsandbytes.

```py
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaVideoTransformer3DModel, SanaVideoPipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel

quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = AutoModel.from_pretrained(
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
subfolder="text_encoder",
quantization_config=quant_config,
torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = SanaVideoTransformer3DModel.from_pretrained(
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)

pipeline = SanaVideoPipeline.from_pretrained(
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
text_encoder=text_encoder_8bit,
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)

model_score = 30
prompt = "Evening, backlight, side lighting, soft light, high contrast, mid-shot, centered composition, clean solo shot, warm color. A young Caucasian man stands in a forest, golden light glimmers on his hair as sunlight filters through the leaves. He wears a light shirt, wind gently blowing his hair and collar, light dances across his face with his movements. The background is blurred, with dappled light and soft tree shadows in the distance. The camera focuses on his lifted gaze, clear and emotional."
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
motion_prompt = f" motion score: {model_score}."
prompt = prompt + motion_prompt

output = pipeline(
prompt=prompt,
negative_prompt=negative_prompt,
height=480,
width=832,
num_frames=81,
guidance_scale=6.0,
num_inference_steps=50
).frames[0]
export_to_video(output, "sana-video-output.mp4", fps=16)
```

## SanaVideoPipeline

[[autodoc]] SanaVideoPipeline
- all
- __call__


## SanaVideoPipelineOutput

[[autodoc]] pipelines.sana.pipeline_sana_video.SanaVideoPipelineOutput
Loading
Loading