-
Notifications
You must be signed in to change notification settings - Fork 6.5k
add ChronoEdit #12593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add ChronoEdit #12593
Conversation
| from ..modeling_outputs import Transformer2DModelOutput | ||
| from ..modeling_utils import ModelMixin | ||
| from ..normalization import FP32LayerNorm | ||
| from .transformer_wan import WanTimeTextImageEmbedding, WanTransformerBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we copy over these 2 things and add a #Copied from, instead of importing from wan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that makes sense. so we’ll need to copy the all the modules in transformer_wan here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR! I left one question about whether we support any number of num_frame
other than that, I think we should remove stuff that's in wan but not needed here for chrono to simplify the code a bit, but if you want to keep it consistent and may support these features in the future, that's ok too
| self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial) | ||
| self.image_processor = image_processor | ||
|
|
||
| def _get_t5_prompt_embeds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a Copied from if it's same one as Wan
|
|
||
| return prompt_embeds | ||
|
|
||
| def encode_image( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| image_encoder: CLIPVisionModel = None, | ||
| transformer: ChronoEditTransformer3DModel = None, | ||
| transformer_2: ChronoEditTransformer3DModel = None, | ||
| boundary_ratio: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| boundary_ratio: Optional[float] = None, |
if we don't support the two stage denoising loop, let's remove parameter and all its related logic, to simplify the pipeline a bit
| num_frames: int = 81, | ||
| num_inference_steps: int = 50, | ||
| guidance_scale: float = 5.0, | ||
| guidance_scale_2: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| guidance_scale_2: Optional[float] = None, |
| prompt_embeds: Optional[torch.Tensor] = None, | ||
| negative_prompt_embeds: Optional[torch.Tensor] = None, | ||
| image_embeds: Optional[torch.Tensor] = None, | ||
| last_image: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a image editing task and can output video to show the reasoning process, no? what would be a meaningful use case to also pass a last_iamge parameter here?
| if self.config.boundary_ratio is not None and image_embeds is not None: | ||
| raise ValueError("Cannot forward `image_embeds` when the pipeline's `boundary_ratio` is not configured.") | ||
|
|
||
| def prepare_latents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is same as in wan i2v too?
if you want to just add a #Copied from and keep this method as it is, it's fine! we can also just remove all the logics we don't need here related to last_frame and expand_timesteps
| freqs_cos = self.freqs_cos.split(split_sizes, dim=1) | ||
| freqs_sin = self.freqs_sin.split(split_sizes, dim=1) | ||
|
|
||
| assert num_frames == 2 or num_frames == self.temporal_skip_len, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand this check here, I think after temporal reasoning step, mum_frames is 2, but other than that e.g. if temporal reasoning is not enabled, this dimension will have various lengths, based on the num_frames variable the users passed to pipeline, no?
if our model can only work with fixed num_frames, maybe we can throw an error from the pipeline when we check the inputs?
add ChronoEdit
This PR adds ChronoEdit, a state-of-the-art image editing model that reframes image editing as a video generation task to achieve physically consistent edits.
HF Model: https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers
Gradio Demo: https://huggingface.co/spaces/nvidia/ChronoEdit
Paper: https://arxiv.org/abs/2510.04290
Code: https://github.com/nv-tlabs/ChronoEdit
Website: https://research.nvidia.com/labs/toronto-ai/chronoedit/
cc: @sayakpaul @yiyixuxu @asomoza
Usage
Full model
Full model with temporal reasoning
With 8-steps distillation LoRA