photoncloud-monorepo/docs/por/T041-chainfire-cluster-join-fix/option-c-snapshot-preseed.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

3.7 KiB

Option C: Snapshot Pre-seed Workaround

Problem

OpenRaft 0.9.21 has a bug where the assertion upto >= log_id_range.prev fails in progress/inflight/mod.rs:178 during learner replication. This occurs when:

  1. A learner is added to a cluster with add_learner()
  2. The leader's progress tracking state becomes inconsistent during initial log replication

Root Cause Analysis

When a new learner joins, it has empty log state. The leader must replicate all logs from the beginning. During this catch-up phase, OpenRaft's progress tracking can become inconsistent when:

  • Replication streams are re-spawned
  • Progress reverts to zero
  • The upto >= log_id_range.prev invariant is violated

Workaround Approach: Snapshot Pre-seed

Instead of relying on OpenRaft's log replication to catch up the learner, we pre-seed the learner with a snapshot before adding it to the cluster.

How It Works

  1. Leader exports snapshot:

    // On leader node
    let snapshot = raft_storage.get_current_snapshot().await?;
    let bytes = snapshot.snapshot.into_inner(); // Vec<u8>
    
  2. Transfer snapshot to learner:

    • Via file copy (manual)
    • Via new gRPC API endpoint (automated)
  3. Learner imports snapshot:

    // On learner node, before starting Raft
    let snapshot = Snapshot::from_bytes(&bytes)?;
    snapshot_builder.apply(&snapshot)?;
    
    // Also set log state to match snapshot
    log_storage.purge(snapshot.meta.last_log_index)?;
    
  4. Add pre-seeded learner:

    • Learner already has state at last_log_index
    • Only recent entries (since snapshot) need replication
    • Minimal replication window avoids the bug

Implementation Options

Option C1: Manual Data Directory Copy

  • Copy leader's data_dir/ to learner before starting
  • Simplest, but requires manual intervention
  • Good for initial cluster setup

Option C2: New ClusterService API

service ClusterService {
    // Existing
    rpc AddMember(AddMemberRequest) returns (AddMemberResponse);

    // New
    rpc TransferSnapshot(TransferSnapshotRequest) returns (stream TransferSnapshotResponse);
}

message TransferSnapshotRequest {
    uint64 target_node_id = 1;
    string target_addr = 2;
}

message TransferSnapshotResponse {
    bytes chunk = 1;
    bool done = 2;
    SnapshotMeta meta = 3; // Only in first chunk
}

Modified join flow:

  1. ClusterService::add_member() first calls TransferSnapshot() to pre-seed
  2. Waits for learner to apply snapshot
  3. Then calls add_learner()

Option C3: Bootstrap from Snapshot

Add config option bootstrap_from = "node_id":

  • Node fetches snapshot from specified node on startup
  • Applies it before joining cluster
  • Then waits for add_learner() call

Pros:

  • Automated, no manual intervention
  • Works with dynamic cluster expansion
  • Fits existing gRPC architecture

Cons:

  • More code to implement (~200-300L)
  • Snapshot transfer adds latency to join

Files to Modify

  1. chainfire/proto/cluster.proto - Add TransferSnapshot RPC
  2. chainfire-api/src/cluster_service.rs - Implement snapshot transfer
  3. chainfire-api/src/cluster_service.rs - Modify add_member flow
  4. chainfire-storage/src/snapshot.rs - Expose snapshot APIs

Test Plan

  1. Start single-node cluster
  2. Write some data (create entries in log)
  3. Start second node
  4. Call add_member() - should trigger snapshot transfer
  5. Verify second node receives data
  6. Verify no assertion failures

Estimated Effort

  • Implementation: 3-4 hours
  • Testing: 1-2 hours
  • Total: 4-6 hours

Status

  • Research complete
  • Awaiting 24h timer for upstream OpenRaft response
  • Implementation (if needed)