photoncloud-monorepo/docs/por/T041-chainfire-cluster-join-fix/option-c-snapshot-preseed.md

# Option C: Snapshot Pre-seed Workaround

## Problem
OpenRaft 0.9.21 has a bug where the assertion `upto >= log_id_range.prev` fails in `progress/inflight/mod.rs:178` during learner replication. This occurs when:
1. A learner is added to a cluster with `add_learner()`
2. The leader's progress tracking state becomes inconsistent during initial log replication

## Root Cause Analysis
When a new learner joins, it has empty log state. The leader must replicate all logs from the beginning. During this catch-up phase, OpenRaft's progress tracking can become inconsistent when:
- Replication streams are re-spawned
- Progress reverts to zero
- The `upto >= log_id_range.prev` invariant is violated

## Workaround Approach: Snapshot Pre-seed

Instead of relying on OpenRaft's log replication to catch up the learner, we pre-seed the learner with a snapshot before adding it to the cluster.

### How It Works

1. **Leader exports snapshot:**
   ```rust
   // On leader node
   let snapshot = raft_storage.get_current_snapshot().await?;
   let bytes = snapshot.snapshot.into_inner(); // Vec<u8>
   ```

2. **Transfer snapshot to learner:**
   - Via file copy (manual)
   - Via new gRPC API endpoint (automated)

3. **Learner imports snapshot:**
   ```rust
   // On learner node, before starting Raft
   let snapshot = Snapshot::from_bytes(&bytes)?;
   snapshot_builder.apply(&snapshot)?;

   // Also set log state to match snapshot
   log_storage.purge(snapshot.meta.last_log_index)?;
   ```

4. **Add pre-seeded learner:**
   - Learner already has state at `last_log_index`
   - Only recent entries (since snapshot) need replication
   - Minimal replication window avoids the bug

### Implementation Options

#### Option C1: Manual Data Directory Copy
- Copy leader's `data_dir/` to learner before starting
- Simplest, but requires manual intervention
- Good for initial cluster setup

#### Option C2: New ClusterService API
```protobuf
service ClusterService {
    // Existing
    rpc AddMember(AddMemberRequest) returns (AddMemberResponse);

    // New
    rpc TransferSnapshot(TransferSnapshotRequest) returns (stream TransferSnapshotResponse);
}

message TransferSnapshotRequest {
    uint64 target_node_id = 1;
    string target_addr = 2;
}

message TransferSnapshotResponse {
    bytes chunk = 1;
    bool done = 2;
    SnapshotMeta meta = 3; // Only in first chunk
}
```

Modified join flow:
1. `ClusterService::add_member()` first calls `TransferSnapshot()` to pre-seed
2. Waits for learner to apply snapshot
3. Then calls `add_learner()`

#### Option C3: Bootstrap from Snapshot
Add config option `bootstrap_from = "node_id"`:
- Node fetches snapshot from specified node on startup
- Applies it before joining cluster
- Then waits for `add_learner()` call

### Recommended Approach: C2 (API-based)

**Pros:**
- Automated, no manual intervention
- Works with dynamic cluster expansion
- Fits existing gRPC architecture

**Cons:**
- More code to implement (~200-300L)
- Snapshot transfer adds latency to join

### Files to Modify

1. `chainfire/proto/cluster.proto` - Add TransferSnapshot RPC
2. `chainfire-api/src/cluster_service.rs` - Implement snapshot transfer
3. `chainfire-api/src/cluster_service.rs` - Modify add_member flow
4. `chainfire-storage/src/snapshot.rs` - Expose snapshot APIs

### Test Plan

1. Start single-node cluster
2. Write some data (create entries in log)
3. Start second node
4. Call add_member() - should trigger snapshot transfer
5. Verify second node receives data
6. Verify no assertion failures

### Estimated Effort
- Implementation: 3-4 hours
- Testing: 1-2 hours
- Total: 4-6 hours

### Status
- [x] Research complete
- [ ] Awaiting 24h timer for upstream OpenRaft response
- [ ] Implementation (if needed)