Skip to content

Conversation

@dkhalanskyjb
Copy link
Collaborator

@dkhalanskyjb dkhalanskyjb commented Sep 18, 2025

Fix authored by @vsalavatov.

The race could lead to CPU tasks not being executed even when CPU cores are available.

Fixes #4491

@qwwdfsad
Copy link
Member

qwwdfsad commented Nov 6, 2025

Note during the review: quite similar to #3660

Copy link
Member

@qwwdfsad qwwdfsad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't merge this one (apart from the test that should be marked as @Ignore).

#4491 and #3660 are issues of the same equivalence class, and the proposed fix addresses only a very specific, nailed issue.

Consider the same pattern, but manifested in a slightly different manner:

  • T_CPU holds the only CPU permit, scans the tasks, doesn't find anything, places itself on a stack
  • T_CPU scans again, doesn't find anything again, a thread switch happens at tryPark()
  • T_B (or several workers in BLOCKING mode) also put themselves on the stack, on top of the T_CPU
    // NB: the diff is here
  • T_B', another blocking worker, schedules a CPU task (dispatched to its local queue), wakes up T_B
    // End of the diff
  • T_B can't acquire a CPU permit, scans blocking queue(s), doesn't find anything, parks
  • T_CPU releases the CPU permit, parks there are tasks in the CPU queue, but all workers are parked, so the scheduler won't make progress until there is another dispatch

The problem is basically the same, but the fix does not address it.

The root cause is not a missing check but rather a systemic flaw -- communication between threads (e.g. stack-based parking/unparking) and handover of the CPU permits are not synchronized, which leads to a whole plethora of "park and act" races.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants