-
Notifications
You must be signed in to change notification settings - Fork 49
Feature/rebalance #172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/rebalance #172
Conversation
| self.scheduler.enqueue_leave(node.node_id) | ||
| # Check if this is a rebalance-triggered leave (to avoid cascading rebalances) | ||
| logger.info(f"Node {node.node_id} leaving (is_rebalance_leave={is_rebalance_leave})") | ||
| self.scheduler.enqueue_leave(node.node_id, is_rebalance_leave=is_rebalance_leave) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is relablance_leave, can we just ignore it from client side? client don't send it to scheduler
cmd.txt
Outdated
| @@ -0,0 +1,9 @@ | |||
| cd code/gradient/parallax && source .venv/bin/activate && export HF_ENDPOINT=https://hf-mirror.com | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
|
||
| @app.get("/model/list") | ||
| async def model_list(): | ||
| model_list_names = get_model_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you confirm with rymon of the UI change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, rymon
|
your change looks full of debug log and chaotic code structure, mixed with AI generated code, please refactor it. |
| else: | ||
| self.kv_cache_manager.release_request(req.request_id) | ||
|
|
||
| logger.info("✅ Executor run_loop exited cleanly (stop flag set)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to include emoji in our log info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| self._bootstrapped_event: threading.Event = threading.Event() | ||
| # Track if rebalance is needed (nodes should restart) | ||
| self._rebalance_restart_needed: bool = False | ||
| self._rebalance_reason: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this string? Can you simply use Log?
| self._node_count_cv.notify_all() | ||
| return | ||
|
|
||
| # Check if we need to trigger global rebalance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better modularity, put them into a function, maybe called should_rebalance, and rename previous should_global_rebalance function to perhaps load_balance_check.
|
|
||
| # Skip rebalance trigger if this leave is part of a rebalance restart | ||
| # (to avoid cascading rebalances when nodes exit to restart) | ||
| if is_rebalance_leave: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't branch here. Otherwise node_count_cv.notify_all() is repeated at the end of this function
| return False | ||
| return True | ||
|
|
||
| def has_contiguous_pipeline(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove has_full_active_pipeline, and merge this function to has_full_pipeline. Add an additional function argument for this contiguous check. Also, we don't need to trigger global_rebalance in this case. Node 2 can stay untouched. Our principle is to avoid global shutdown/restart, so in this case we should change Node 1 only. We already have similar logic implemented. Check this function.
| if not self.layer_allocator.has_full_pipeline(): | ||
| needs_rebalance = True | ||
| rebalance_reason = f"Node {node_id} left, no complete pipeline coverage" | ||
| # Case 2: Has full coverage but not contiguous (has overlaps or gaps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check my previous comment. We don't need global rebalance here.
📋 PR issues 79 Node rebalance
#79