Feature/rebalance #172

JasonOE · 2025-10-29T12:05:55Z

📋 PR issues 79 Node rebalance

gufengc · 2025-10-30T02:49:55Z

src/backend/server/rpc_connection_handler.py

-            self.scheduler.enqueue_leave(node.node_id)
+            # Check if this is a rebalance-triggered leave (to avoid cascading rebalances)
+            logger.info(f"Node {node.node_id} leaving (is_rebalance_leave={is_rebalance_leave})")
+            self.scheduler.enqueue_leave(node.node_id, is_rebalance_leave=is_rebalance_leave)


if it is relablance_leave, can we just ignore it from client side? client don't send it to scheduler

gufengc · 2025-10-30T02:50:10Z

cmd.txt

@@ -0,0 +1,9 @@
+cd code/gradient/parallax && source .venv/bin/activate && export HF_ENDPOINT=https://hf-mirror.com


gufengc · 2025-10-30T02:51:51Z

src/backend/main.py


 @app.get("/model/list")
 async def model_list():
+    model_list_names = get_model_list()


could you confirm with rymon of the UI change

gufengc · 2025-10-30T02:58:44Z

your change looks full of debug log and chaotic code structure, mixed with AI generated code, please refactor it.
And split it into small PR if possible

christ-tt · 2025-10-30T03:51:04Z

src/parallax/server/executor.py

                    else:
                        self.kv_cache_manager.release_request(req.request_id)
+
+        logger.info("✅ Executor run_loop exited cleanly (stop flag set)")


We don't want to include emoji in our log info.

christ-tt · 2025-10-30T03:57:29Z

src/scheduling/scheduler.py

        self._bootstrapped_event: threading.Event = threading.Event()
+        # Track if rebalance is needed (nodes should restart)
+        self._rebalance_restart_needed: bool = False
+        self._rebalance_reason: str = ""


Do we need this string? Can you simply use Log?

christ-tt · 2025-10-30T04:01:16Z

src/scheduling/scheduler.py

+                self._node_count_cv.notify_all()
+            return
+
+        # Check if we need to trigger global rebalance


For better modularity, put them into a function, maybe called should_rebalance, and rename previous should_global_rebalance function to perhaps load_balance_check.

christ-tt · 2025-10-30T04:02:09Z

src/scheduling/scheduler.py

+
+        # Skip rebalance trigger if this leave is part of a rebalance restart
+        # (to avoid cascading rebalances when nodes exit to restart)
+        if is_rebalance_leave:


Don't branch here. Otherwise node_count_cv.notify_all() is repeated at the end of this function

christ-tt · 2025-10-30T04:08:43Z

src/scheduling/layer_allocation.py

                return False
        return True

+    def has_contiguous_pipeline(self) -> bool:


We can remove has_full_active_pipeline, and merge this function to has_full_pipeline. Add an additional function argument for this contiguous check. Also, we don't need to trigger global_rebalance in this case. Node 2 can stay untouched. Our principle is to avoid global shutdown/restart, so in this case we should change Node 1 only. We already have similar logic implemented. Check this function.

christ-tt · 2025-10-30T04:09:29Z

src/scheduling/scheduler.py

+        if not self.layer_allocator.has_full_pipeline():
+            needs_rebalance = True
+            rebalance_reason = f"Node {node_id} left, no complete pipeline coverage"
+        # Case 2: Has full coverage but not contiguous (has overlaps or gaps)


Check my previous comment. We don't need global rebalance here.

jason added 8 commits October 28, 2025 08:06

add layer show out

28c20bc

add layer for front page

85f0d79

restart after node leave , no whole pipeline

e7fc04b

node leave , pipeline destroy, restart node successful

6f81ac1

ok for node leave

a0f053a

optimize layer load

be2e653

fix frontend show layers

59b3da2

too more : SchedulerManage status , change to once per second

9c340ab

JasonOE requested review from gufengc and sl-gn October 29, 2025 12:13

gufengc reviewed Oct 30, 2025

View reviewed changes

christ-tt reviewed Oct 30, 2025

View reviewed changes

jason added 2 commits October 30, 2025 22:09

merge main

fd7027b

rebuild frontend

cd0b2d8

JasonOE closed this Nov 7, 2025

		@@ -0,0 +1,9 @@
		cd code/gradient/parallax && source .venv/bin/activate && export HF_ENDPOINT=https://hf-mirror.com

Feature/rebalance #172

Feature/rebalance #172

Uh oh!

Conversation

JasonOE commented Oct 29, 2025

📋 PR issues 79 Node rebalance

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gufengc commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christ-tt Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christ-tt Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christ-tt Oct 30, 2025 •

edited

Loading

christ-tt Oct 30, 2025 •

edited

Loading