feat(scheduler): scheduler change after node leave #194

JasonOE · 2025-11-05T12:36:38Z

Refactor Pipeline Validation and Improve Node State Management

Summary

This PR refactors the pipeline validation logic, improves node state management, and enhances error handling in the scheduler. The main changes include:

Unified pipeline validation: Consolidates has_full_active_pipeline() into has_full_pipeline() with an active_only parameter
Improved pipeline detection algorithm: Replaces simple layer counting with a DFS-based approach that correctly validates contiguous pipeline chains
Better node state management: Ensures is_active flag is properly set when nodes are deallocated
Enhanced error handling: Adds proper logging and state updates for global rebalancing operations
Code cleanup: Removes unnecessary debug logging statements

Changes

Core Improvements

1. Unified Pipeline Validation (`layer_allocation.py`)

Before: Two separate methods has_full_pipeline() and has_full_active_pipeline() with different implementations
After: Single has_full_pipeline(active_only: bool = False) method that handles both cases
Algorithm improvement: Replaced simple layer counting with DFS-based validation that correctly checks for contiguous node chains from layer 0 to num_total_layers

The new implementation:

Builds an index of nodes by start_layer
Uses DFS to verify a continuous path exists from layer 0 to the end
Properly filters inactive nodes when active_only=True

2. Node State Management (`layer_allocation.py`)

Added node.is_active = False in deallocate() to ensure nodes are marked inactive when deallocated
This prevents inactive nodes from being considered in pipeline validation

3. Enhanced Error Handling (`scheduler.py`)

Added proper success/failure logging for global rebalancing operations
Ensures _bootstrapped flag and _bootstrapped_event are correctly updated after rebalancing
Provides better visibility into scheduler state transitions

4. API Updates (`scheduler_manage.py`)

Updated get_schedule_status() to use the new unified has_full_pipeline(active_only=True) method
Maintains the same functionality while using the improved implementation

Code Cleanup

Removed debug logging statements from:
- model_info.py: Removed verbose logging in decoder_layer_io_bytes()
- node.py: Removed debug logging in get_decoder_layer_capacity()

Benefits

Correctness: The DFS-based pipeline validation correctly identifies contiguous pipelines, fixing potential edge cases with the previous layer-counting approach
Maintainability: Unified API reduces code duplication and makes the codebase easier to maintain
Reliability: Proper node state management ensures inactive nodes don't interfere with pipeline validation
Observability: Better logging helps diagnose scheduler state issues

Testing

Existing tests should continue to pass with the improved pipeline validation logic
The DFS-based approach is more robust and handles edge cases better than the previous implementation

Migration Notes

No breaking changes for external APIs
Internal callers of has_full_active_pipeline() should use has_full_pipeline(active_only=True) instead

src/scheduling/layer_allocation.py

src/parallax/p2p/server.py

christ-tt

LGTM after minor fix. Good work and thanks a lot!

src/scheduling/model_info.py

christ-tt · 2025-11-07T06:54:38Z

I think we can merge this for now.
As a TODO, let's consider this case: suppose we have 3 identical nodes, our model has 28 layers, each node can host 22 layers. After bootstrapping, we will have node 1: [0, 14), node 2: [14, 28), and node 3, currently idling, hosting [0, 22). Now, suppose node 1 left, we now have node 2 and node 3; here we don't necessarily need global rebalance - node 2 can stay unchanged, and we only need to trim the layer hosted on node 3 ([0, 22) -> [0, 14)).
We have very similar logic implemented in request routing phase. I think we can try to implement a separate function here in layer_allocation, checking if trimming any nodes makes sense, and put this before should_global_rebalance check. Especially for active but idling (not forming any actual pipeline, or serving any request) nodes.
But for now, I think current implementation is good enough. We can open another PR handling this optimization.
Thanks.

scheduler change after node leave

190660a

JasonOE requested review from christ-tt, gufengc and sl-gn November 5, 2025 12:36

sl-gn approved these changes Nov 5, 2025

View reviewed changes

christ-tt reviewed Nov 5, 2025

View reviewed changes

src/scheduling/layer_allocation.py Outdated Show resolved Hide resolved

christ-tt reviewed Nov 5, 2025

View reviewed changes

src/scheduling/layer_allocation.py Outdated Show resolved Hide resolved

gufengc reviewed Nov 5, 2025

View reviewed changes

src/parallax/p2p/server.py Outdated Show resolved Hide resolved

christ-tt approved these changes Nov 5, 2025

View reviewed changes

fix : bootstrapped true after node leave

2dd38db

JasonOE requested a review from a team November 6, 2025 06:16

JasonOE and others added 6 commits November 6, 2025 14:58

Merge branch 'main' into mini-pr

607fbda

merge main

639be4d

no global_allocation when left nodes can cover

6802db7

remove has_full_active_pipeline

d6bb0aa

remove no global allocation fix, for pytest error

ec9f8de

is_active: bool = True

f6de9eb

gufengc reviewed Nov 6, 2025

View reviewed changes

src/scheduling/model_info.py Outdated Show resolved Hide resolved

jason added 3 commits November 6, 2025 21:21

reback: no global_allocation when left nodes can cover

11b74f9

small change after gufeng review

b77a59d

remove _try_adjust_allocations_to_form_pipeline

af8a00c

gufengc approved these changes Nov 7, 2025

View reviewed changes

JasonOE merged commit 6c4560f into main Nov 7, 2025
4 checks passed

JasonOE deleted the mini-pr branch November 7, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(scheduler): scheduler change after node leave #194

feat(scheduler): scheduler change after node leave #194

Uh oh!

JasonOE commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christ-tt left a comment

Uh oh!

Uh oh!

christ-tt commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(scheduler): scheduler change after node leave #194

feat(scheduler): scheduler change after node leave #194

Uh oh!

Conversation

JasonOE commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor Pipeline Validation and Improve Node State Management

Summary

Changes

Core Improvements

1. Unified Pipeline Validation (layer_allocation.py)

2. Node State Management (layer_allocation.py)

3. Enhanced Error Handling (scheduler.py)

4. API Updates (scheduler_manage.py)

Code Cleanup

Benefits

Testing

Migration Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christ-tt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christ-tt commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JasonOE commented Nov 5, 2025 •

edited

Loading

1. Unified Pipeline Validation (`layer_allocation.py`)

2. Node State Management (`layer_allocation.py`)

3. Enhanced Error Handling (`scheduler.py`)

4. API Updates (`scheduler_manage.py`)