Skip to content

Conversation

@JasonOE
Copy link
Collaborator

@JasonOE JasonOE commented Nov 5, 2025

Refactor Pipeline Validation and Improve Node State Management

Summary

This PR refactors the pipeline validation logic, improves node state management, and enhances error handling in the scheduler. The main changes include:

  1. Unified pipeline validation: Consolidates has_full_active_pipeline() into has_full_pipeline() with an active_only parameter
  2. Improved pipeline detection algorithm: Replaces simple layer counting with a DFS-based approach that correctly validates contiguous pipeline chains
  3. Better node state management: Ensures is_active flag is properly set when nodes are deallocated
  4. Enhanced error handling: Adds proper logging and state updates for global rebalancing operations
  5. Code cleanup: Removes unnecessary debug logging statements

Changes

Core Improvements

1. Unified Pipeline Validation (layer_allocation.py)

  • Before: Two separate methods has_full_pipeline() and has_full_active_pipeline() with different implementations
  • After: Single has_full_pipeline(active_only: bool = False) method that handles both cases
  • Algorithm improvement: Replaced simple layer counting with DFS-based validation that correctly checks for contiguous node chains from layer 0 to num_total_layers

The new implementation:

  • Builds an index of nodes by start_layer
  • Uses DFS to verify a continuous path exists from layer 0 to the end
  • Properly filters inactive nodes when active_only=True

2. Node State Management (layer_allocation.py)

  • Added node.is_active = False in deallocate() to ensure nodes are marked inactive when deallocated
  • This prevents inactive nodes from being considered in pipeline validation

3. Enhanced Error Handling (scheduler.py)

  • Added proper success/failure logging for global rebalancing operations
  • Ensures _bootstrapped flag and _bootstrapped_event are correctly updated after rebalancing
  • Provides better visibility into scheduler state transitions

4. API Updates (scheduler_manage.py)

  • Updated get_schedule_status() to use the new unified has_full_pipeline(active_only=True) method
  • Maintains the same functionality while using the improved implementation

Code Cleanup

  • Removed debug logging statements from:
    • model_info.py: Removed verbose logging in decoder_layer_io_bytes()
    • node.py: Removed debug logging in get_decoder_layer_capacity()

Benefits

  1. Correctness: The DFS-based pipeline validation correctly identifies contiguous pipelines, fixing potential edge cases with the previous layer-counting approach
  2. Maintainability: Unified API reduces code duplication and makes the codebase easier to maintain
  3. Reliability: Proper node state management ensures inactive nodes don't interfere with pipeline validation
  4. Observability: Better logging helps diagnose scheduler state issues

Testing

  • Existing tests should continue to pass with the improved pipeline validation logic
  • The DFS-based approach is more robust and handles edge cases better than the previous implementation

Migration Notes

  • No breaking changes for external APIs
  • Internal callers of has_full_active_pipeline() should use has_full_pipeline(active_only=True) instead

Copy link
Collaborator

@christ-tt christ-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after minor fix. Good work and thanks a lot!

@JasonOE JasonOE requested a review from a team November 6, 2025 06:16
@christ-tt
Copy link
Collaborator

I think we can merge this for now.
As a TODO, let's consider this case: suppose we have 3 identical nodes, our model has 28 layers, each node can host 22 layers. After bootstrapping, we will have node 1: [0, 14), node 2: [14, 28), and node 3, currently idling, hosting [0, 22). Now, suppose node 1 left, we now have node 2 and node 3; here we don't necessarily need global rebalance - node 2 can stay unchanged, and we only need to trim the layer hosted on node 3 ([0, 22) -> [0, 14)).
We have very similar logic implemented in request routing phase. I think we can try to implement a separate function here in layer_allocation, checking if trimming any nodes makes sense, and put this before should_global_rebalance check. Especially for active but idling (not forming any actual pipeline, or serving any request) nodes.
But for now, I think current implementation is good enough. We can open another PR handling this optimization.
Thanks.

@JasonOE JasonOE merged commit 6c4560f into main Nov 7, 2025
4 checks passed
@JasonOE JasonOE deleted the mini-pr branch November 7, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants