Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Why

  • ConnectV1’s handshake/routing path routinely wedged gateways and dropped multi-hop traffic; River’s larger integration tests reproduced this every run.
  • Migrating to ConnectV2 touched every stage of join/maintenance, so the previous PR description was no longer accurate and reviewers lacked a reliable summary of the new surface area.
  • The River harness itself needed upgrades (stable ports, better logging, retries, diagnostics) to prove the network fixes actually stick.

What changed

ConnectV2 migration

  • Introduced ConnectV2 op scaffolding, message types, and joiner/relay helpers, and routed them through the op manager.
  • Taught maintenance/join flows to invoke the new ConnectV2 paths, then removed/ignored the leftover V1 message handlers so we only have one handshake implementation.
  • Deduplicated concurrent outbound connection attempts and cleaned up courtesy connections so gateways keep forwarding JOINs while still accepting new peers.

Routing, subscription, and update stability

  • Ensure subscription acknowledgements survive races between the WebAPI client and LocalSubscribeComplete, and allow updates/GET fallbacks to walk past local minima.
  • Clamp HTL and reuse fallback seeker logic so UPDATE propagation doesn’t stall after the first hop.
  • Fixes around connection_manager.add_connection, maintenance triggers, and warning cleanup to keep topology state accurate.

River integration harness

  • Reserve unique port blocks per peer, keep nodes alive on handshake drops, and retry the flaky three-hop PUT path so failures are real network issues, not harness flukes.
  • Default the CLI/tests to prettier logging unless JSON is requested, bump freenet-stdlib so contracts match the CLI, and preserve per-run artifacts under ~/code/tmp/freenet-test-networks/ with run_status.txt summaries.

Testing

  • cargo check -p freenet
  • cargo test --test operations
  • cargo test --test test_macro_example
  • FREENET_CORE_PATH=~/code/freenet/freenet-core/fix-handshake-dedupe cargo test --test message_flow river_message_flow_over_freenet -- --ignored --exact
  • FREENET_CORE_PATH=~/code/freenet/freenet-core/fix-handshake-dedupe cargo test --test message_flow river_message_flow_over_freenet_six_peers_five_rounds -- --ignored --exact (most recent rerun passes; prior failure logs are preserved under the harness artifact directory for debugging)

sanity added 26 commits November 2, 2025 23:18
Tested: cargo test --test message_flow river_message_flow_over_freenet -- --ignored --exact --nocapture

Tested: cargo test --test message_flow river_message_flow_over_freenet_six_peers_five_rounds -- --ignored --exact --nocapture
@sanity sanity marked this pull request as draft November 7, 2025 00:25
@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

Architecture & Design Review

ConnectV2 Migration: Strong Simplification ✅

The handshake.rs reduction from ~1567 lines to ~224 lines is impressive and suggests a successful architectural simplification. The move from complex handshake orchestration to a lightweight adapter pattern should improve maintainability.

Strengths:

  • Consolidates connection logic into ConnectOp state machine
  • Reduces surface area for race conditions
  • Clearer separation of concerns (transport vs protocol logic)

Questions:

  • How does ConnectV2 handle backward compatibility with v1 peers?
  • What's the migration path for existing network deployments?

Session Actor Race Condition Fix: Critical Improvement ✅

The pending_results mechanism (session_actor.rs:20-41) elegantly solves the LocalSubscribeComplete vs client registration race:

pending_results: HashMap<Transaction, PendingResult>
delivered_clients: HashSet<ClientId>

This deferred delivery pattern ensures results reach late-registering clients, which is essential for the Subscribe operation reliability mentioned in the PR description.

Observation: The session_actor test suite includes test_pending_result_reaches_late_registered_clients and test_pending_result_delivered_after_registration which directly verify this fix.

Code Quality

Positive Patterns

  1. Arc optimization (session_actor.rs:129): Minimizes cloning in 1→N fanout delivery
  2. Comprehensive logging: Good use of tracing throughout for debugging network issues
  3. Test coverage: New tests for race conditions show thoroughness

Areas for Consideration

  1. PR Scope (49 files, ~8k lines changed):

    • Mixes three concerns: ConnectV2 migration, routing fixes, River harness improvements
    • Consider stacked PRs for easier review:
      • PR1: Session actor race condition fix (isolated, high-value)
      • PR2: ConnectV2 scaffolding
      • PR3: River harness stabilization
    • Current monolithic approach makes bisecting regressions harder
  2. Removed Code (handshake/tests.rs deleted entirely):

    • 651 lines of handshake tests removed
    • Are these scenarios now covered by operations.rs or River integration tests?
    • Recommendation: Add comment in commit message explaining test migration
  3. once_cell Dependency (Cargo.toml):

    • New dependency added but usage not visible in the diff snippet
    • What's the motivation? (Could be for lazy statics in connect logic)

Testing & Verification

CI Status: Test job still pending - recommend waiting for full green before final review.

River Integration Tests: The PR mentions successful runs of:

  • river_message_flow_over_freenet
  • river_message_flow_over_freenet_six_peers_five_rounds

This is strong evidence the networking fixes work under realistic multi-hop scenarios.

Manual Testing Recommendations:

  1. Run multi-peer network viz to verify topology stability
  2. Test gateway reconnection with exponential backoff (mentioned in node/mod.rs:1157)
  3. Verify subscription stability under network churn

Risk Assessment

High Impact Areas:

  • Connection handshake is critical path for network formation
  • Subscription reliability affects River chat application
  • Changes touch 49 files across core networking stack

Mitigation:

  • Thorough soak testing on nova's multi-peer setup (~/code/admin/nova/freenet/peer-manager.sh)
  • Network visualization to confirm topology health
  • Monitor for any increased connection failures in production

Detailed Findings

client_events/mod.rs:1155-1189

The Subscribe operation now generates the transaction ID before registering with the session actor, fixing the race where LocalSubscribeComplete could fire before registration. Clean fix.

operations/connect.rs

Massive restructuring (892 additions / 1182 deletions). Would benefit from:

  • High-level comment explaining ConnectV2 state transitions
  • Diagram of the new handshake flow (even ASCII art helps)

operations/update.rs:198 additions

Significant expansion - what's the main change here? HTL clamping mentioned in PR description? Recommend adding doc comments for the new logic.

ring/connection_manager.rs

147 additions suggest substantial refactoring. Key question: How does the new "deduplicated concurrent outbound connection attempts" work? This is mentioned in the PR description but the mechanism isn't clear from the diff snippet.

Recommendations

Before merging:

  1. Wait for full CI green (Test job pending)
  2. Add high-level architecture comment in connect.rs explaining ConnectV2 flow
  3. Confirm handshake test coverage migrated to operations.rs or River tests
  4. Consider documenting the migration path from ConnectV1 (if relevant)

Post-merge:

  1. Soak test on nova multi-peer setup for 24+ hours
  2. Monitor connection success rates
  3. Run network-viz periodically to verify topology health

For future PRs:
Consider breaking up large refactors using stacked PRs per the AGENTS.md guidelines - it enables:

  • Focused review of each concern
  • Easier bisecting if regressions appear
  • Incremental merge of low-risk changes while higher-risk parts get extra scrutiny

Summary

This is high-quality, necessary work that addresses real network stability issues River was hitting. The session actor race fix is elegant, and the handshake simplification is architecturally sound.

Primary concern: The 49-file scope makes comprehensive review challenging. The changes appear well-tested (River integration passes), but thorough soak testing on multi-peer networks is essential before production use.

Verdict: Strong foundation once CI completes. The networking fixes appear solid based on the River test results. Main suggestion is considering stacked PRs for future refactors of this magnitude.

[AI-assisted - Claude]

@freenet freenet deleted a comment from claude bot Nov 7, 2025
self.client_transactions.get(&tx).map_or(0, |s| s.len())
);

if let Some(result_arc) = self.pending_results.get_mut(&tx).and_then(|pending| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we leak this forever? we never remove from this map?

@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

This PR has been replaced by subsequent stacked PRs.

@sanity sanity closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants