Outages And Recovery
What Happens If Kvasyr Is Down
While the process is down:
- No new chain indexing ticks run.
- No webhook deliveries are attempted.
- No backfill jobs are processed.
After restart:
- Indexing resumes from the last confirmed point for each chain.
- Kvasyr catches up to current finalized chain state, then returns to normal near-real-time delivery.
- Finality still applies, so newest head-block events wait until they cross chain
finality_depth.
What Catch-Up Means For Webhooks
During catch-up, webhook traffic can temporarily spike because delayed finalized events are delivered quickly.
- Delivery semantics remain at-least-once.
- Event arrival order can vary during catch-up and retries.
- Your webhook handler should be idempotent.
If A Client Webhook Is Unreachable
Success condition:
- Only HTTP
2xxmarks a delivery asdelivered.
Failure behavior (4xx, 5xx, timeout, connect errors):
- Kvasyr retries automatically with exponential backoff.
- Retries stop after the configured maximum attempts.
If endpoint becomes reachable later:
- If max attempts has not been reached, a later retry can still succeed.
- If max attempts has been reached, an admin can manually retry and, when needed, reset attempt counters for a clean re-drive.
Tracking Catch-Up Bursts
Use event-id deduplication and track unique-vs-duplicate traffic during recovery windows.
// Pseudocode for an idempotent webhook handler
const eventId = req.header("x-kvasyr-event-id") ?? req.body.id;
if (!eventId) return res.status(400).send("missing event id");
if (await seenBefore(eventId)) {
return res.status(200).send("duplicate ignored");
}
await markSeen(eventId, { ttlSeconds: 7 * 24 * 60 * 60 });
await processBusinessLogic(req.body);
return res.status(200).send("ok");Operationally, it helps to chart:
- Incoming webhook rate per minute.
- Unique event IDs per minute.
- Duplicate ratio (duplicates / total).
Client Footguns Checklist
- Deduplicate by event identity (
payload.idorX-Kvasyr-Event-Id). - Treat delivery as at-least-once, not exactly-once.
- Do not assume strict ordering.
- Accept delayed deliveries after outages and process by event content, not arrival time.
- Verify signatures and timestamp freshness on every request.
- Monitor delivery failures and retry backlog so you can respond before events age out.