centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth

- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1

2025-12-12 06:23:46 +09:00

5.2 KiB

Raw Blame History

T040.S4 Service Reconnection Test Scenarios

Overview

Test scenarios for validating service reconnection behavior after transient failures.

Test Environment: Option B2 (Local Multi-Instance)

Approved: 2025-12-11

Setup: 3 instances per service running on localhost with different ports

ChainFire: ports 2379, 2380, 2381 (or similar)
FlareDB: ports 5000, 5001, 5002 (or similar)

Failure Simulation Methods (adapted from VM approach):

Process kill: kill -9 <pid> simulates sudden node failure
SIGTERM: kill <pid> simulates graceful shutdown
Port blocking: iptables -A INPUT -p tcp --dport <port> -j DROP (if root)
Pause: kill -STOP <pid> / kill -CONT <pid> simulates freeze

Note: Cross-VM network partition tests deferred to T039 (production deployment)

Current State Analysis

Services WITH Reconnection Logic

Service	Mechanism	Location
ChainFire	Exponential backoff (3 retries, 2.0x multiplier, 500ms-30s)	`chainfire/crates/chainfire-api/src/raft_client.rs`
FlareDB	PD client auto-reconnect (10s cycle), connection pooling	`flaredb/crates/flaredb-server/src/main.rs:283-356`

Services WITHOUT Reconnection Logic (GAPS)

Service	Gap	Risk
PlasmaVMC	No retry/reconnection	VM operations fail silently on network blip
IAM	No retry mechanism	Auth failures cascade to all services
Watch streams	Break on error, no auto-reconnect	Config/event propagation stops

Test Scenarios

Scenario 1: ChainFire Raft Recovery

Goal: Verify Raft RPC retry logic works under network failures

Steps:

Start 3-node ChainFire cluster
Write key-value pair
Use iptables to block traffic to leader node
Attempt read/write operation from client
Observe retry behavior (should retry with backoff)
Unblock traffic
Verify operation completes or fails gracefully

Expected:

Client retries up to 3 times with exponential backoff
Clear error message on final failure
No data corruption

Evidence: Client logs showing retry attempts, timing

Scenario 2: FlareDB PD Reconnection

Goal: Verify FlareDB server reconnects to ChainFire (PD) after restart

Steps:

Start ChainFire cluster (PD)
Start FlareDB server connected to PD
Verify heartbeat working (check logs)
Kill ChainFire leader
Wait for new leader election
Observe FlareDB reconnection behavior

Expected:

FlareDB logs "Reconnected to PD" within 10-20s
Client operations resume after reconnection
No data loss during transition

Evidence: Server logs, client operation success

Scenario 3: Network Partition (iptables)

Goal: Verify cluster behavior during network partition

Steps:

Start 3-node cluster (ChainFire + FlareDB)
Write data to cluster
Create network partition: iptables -A INPUT -s <node2-ip> -j DROP
Attempt writes (should succeed with 2/3 quorum)
Kill another node (should lose quorum)
Verify writes fail gracefully
Restore partition, verify cluster recovery

Expected:

2/3 nodes: writes succeed
1/3 nodes: writes fail, no data corruption
Recovery: cluster resumes normal operation

Evidence: Write success/failure, data consistency check

Scenario 4: Service Restart Recovery

Goal: Verify clients reconnect after service restart

Steps:

Start service (FlareDB/ChainFire)
Connect client
Perform operations
Restart service (systemctl restart or SIGTERM + start)
Attempt client operations

Expected ChainFire: Client reconnects via retry logic Expected FlareDB: Connection pool creates new connection Expected IAM: Manual reconnect required (gap)

Evidence: Client operation success after restart

Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)

Goal: Document watch stream behavior on connection loss

Steps:

Start ChainFire server
Connect watch client
Verify events received
Kill server
Observe client behavior

Expected: Watch breaks, no auto-reconnect GAP: Need application-level reconnect loop

Evidence: Client logs showing stream termination

Test Matrix

Scenario	ChainFire	FlareDB	PlasmaVMC	IAM
S1: Raft Recovery	TEST	n/a	n/a	n/a
S2: PD Reconnect	n/a	TEST	n/a	n/a
S3: Network Partition	TEST	TEST	SKIP	SKIP
S4: Restart Recovery	TEST	TEST	DOC-GAP	DOC-GAP
S5: Watch Recovery	DOC-GAP	DOC-GAP	n/a	n/a

Prerequisites (Option B2 - Local Multi-Instance)

3 ChainFire instances running on localhost (S1 provides)
3 FlareDB instances running on localhost (S1 provides)
Separate data directories per instance
Logging enabled at DEBUG level for evidence
Process management tools (kill, pkill)
Optional: iptables for port blocking tests (requires root)

Success Criteria

All TEST scenarios pass
GAP scenarios documented with recommendations
No data loss in any failure scenario
Clear error messages on unrecoverable failures

Future Work (Identified Gaps)

PlasmaVMC: Add retry logic for remote service calls
IAM Client: Add exponential backoff retry
Watch streams: Add auto-reconnect wrapper

5.2 KiB Raw Blame History

T040.S4 Service Reconnection Test Scenarios

Overview

Test Environment: Option B2 (Local Multi-Instance)

Current State Analysis

Services WITH Reconnection Logic

Services WITHOUT Reconnection Logic (GAPS)

Test Scenarios

Scenario 1: ChainFire Raft Recovery

Scenario 2: FlareDB PD Reconnection

Scenario 3: Network Partition (iptables)

Scenario 4: Service Restart Recovery

Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)

Test Matrix

Prerequisites (Option B2 - Local Multi-Instance)

Success Criteria

Future Work (Identified Gaps)

5.2 KiB

Raw Blame History