Skip to content

Conversation

@JasonOE
Copy link
Collaborator

@JasonOE JasonOE commented Oct 29, 2025

📋 PR issues 79 Node rebalance

#79

@JasonOE JasonOE requested review from gufengc and sl-gn October 29, 2025 12:13
self.scheduler.enqueue_leave(node.node_id)
# Check if this is a rebalance-triggered leave (to avoid cascading rebalances)
logger.info(f"Node {node.node_id} leaving (is_rebalance_leave={is_rebalance_leave})")
self.scheduler.enqueue_leave(node.node_id, is_rebalance_leave=is_rebalance_leave)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is relablance_leave, can we just ignore it from client side? client don't send it to scheduler

cmd.txt Outdated
@@ -0,0 +1,9 @@
cd code/gradient/parallax && source .venv/bin/activate && export HF_ENDPOINT=https://hf-mirror.com
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


@app.get("/model/list")
async def model_list():
model_list_names = get_model_list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you confirm with rymon of the UI change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, rymon

@gufengc
Copy link
Collaborator

gufengc commented Oct 30, 2025

your change looks full of debug log and chaotic code structure, mixed with AI generated code, please refactor it.
And split it into small PR if possible

else:
self.kv_cache_manager.release_request(req.request_id)

logger.info("✅ Executor run_loop exited cleanly (stop flag set)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to include emoji in our log info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

self._bootstrapped_event: threading.Event = threading.Event()
# Track if rebalance is needed (nodes should restart)
self._rebalance_restart_needed: bool = False
self._rebalance_reason: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this string? Can you simply use Log?

self._node_count_cv.notify_all()
return

# Check if we need to trigger global rebalance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better modularity, put them into a function, maybe called should_rebalance, and rename previous should_global_rebalance function to perhaps load_balance_check.


# Skip rebalance trigger if this leave is part of a rebalance restart
# (to avoid cascading rebalances when nodes exit to restart)
if is_rebalance_leave:
Copy link
Collaborator

@christ-tt christ-tt Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't branch here. Otherwise node_count_cv.notify_all() is repeated at the end of this function

return False
return True

def has_contiguous_pipeline(self) -> bool:
Copy link
Collaborator

@christ-tt christ-tt Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove has_full_active_pipeline, and merge this function to has_full_pipeline. Add an additional function argument for this contiguous check. Also, we don't need to trigger global_rebalance in this case. Node 2 can stay untouched. Our principle is to avoid global shutdown/restart, so in this case we should change Node 1 only. We already have similar logic implemented. Check this function.

if not self.layer_allocator.has_full_pipeline():
needs_rebalance = True
rebalance_reason = f"Node {node_id} left, no complete pipeline coverage"
# Case 2: Has full coverage but not contiguous (has overlaps or gaps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my previous comment. We don't need global rebalance here.

@JasonOE JasonOE closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants