-
Notifications
You must be signed in to change notification settings - Fork 49
feat(scheduler): scheduler change after node leave #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
christ-tt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after minor fix. Good work and thanks a lot!
|
I think we can merge this for now. |
Refactor Pipeline Validation and Improve Node State Management
Summary
This PR refactors the pipeline validation logic, improves node state management, and enhances error handling in the scheduler. The main changes include:
has_full_active_pipeline()intohas_full_pipeline()with anactive_onlyparameteris_activeflag is properly set when nodes are deallocatedChanges
Core Improvements
1. Unified Pipeline Validation (
layer_allocation.py)has_full_pipeline()andhas_full_active_pipeline()with different implementationshas_full_pipeline(active_only: bool = False)method that handles both casesnum_total_layersThe new implementation:
start_layeractive_only=True2. Node State Management (
layer_allocation.py)node.is_active = Falseindeallocate()to ensure nodes are marked inactive when deallocated3. Enhanced Error Handling (
scheduler.py)_bootstrappedflag and_bootstrapped_eventare correctly updated after rebalancing4. API Updates (
scheduler_manage.py)get_schedule_status()to use the new unifiedhas_full_pipeline(active_only=True)methodCode Cleanup
model_info.py: Removed verbose logging indecoder_layer_io_bytes()node.py: Removed debug logging inget_decoder_layer_capacity()Benefits
Testing
Migration Notes
has_full_active_pipeline()should usehas_full_pipeline(active_only=True)instead