DocsOutages and Recovery

Outages And Recovery

What Happens If Kvasyr Is Down

While the process is down:

  • No new chain indexing ticks run.
  • No webhook deliveries are attempted.
  • No backfill jobs are processed.

After restart:

  • Indexing resumes from the last confirmed point for each chain.
  • Kvasyr catches up to current finalized chain state, then returns to normal near-real-time delivery.
  • Finality still applies, so newest head-block events wait until they cross chain finality_depth.

What Catch-Up Means For Webhooks

During catch-up, webhook traffic can temporarily spike because delayed finalized events are delivered quickly.

  • Delivery semantics remain at-least-once.
  • Event arrival order can vary during catch-up and retries.
  • Your webhook handler should be idempotent.

If A Client Webhook Is Unreachable

Success condition:

  • Only HTTP 2xx marks a delivery as delivered.

Failure behavior (4xx, 5xx, timeout, connect errors):

  • Kvasyr retries automatically with exponential backoff.
  • Retries stop after the configured maximum attempts.

If endpoint becomes reachable later:

  • If max attempts has not been reached, a later retry can still succeed.
  • If max attempts has been reached, an admin can manually retry and, when needed, reset attempt counters for a clean re-drive.

Tracking Catch-Up Bursts

Use event-id deduplication and track unique-vs-duplicate traffic during recovery windows.

// Pseudocode for an idempotent webhook handler
const eventId = req.header("x-kvasyr-event-id") ?? req.body.id;
if (!eventId) return res.status(400).send("missing event id");
 
if (await seenBefore(eventId)) {
  return res.status(200).send("duplicate ignored");
}
 
await markSeen(eventId, { ttlSeconds: 7 * 24 * 60 * 60 });
await processBusinessLogic(req.body);
return res.status(200).send("ok");

Operationally, it helps to chart:

  • Incoming webhook rate per minute.
  • Unique event IDs per minute.
  • Duplicate ratio (duplicates / total).

Client Footguns Checklist

  • Deduplicate by event identity (payload.id or X-Kvasyr-Event-Id).
  • Treat delivery as at-least-once, not exactly-once.
  • Do not assume strict ordering.
  • Accept delayed deliveries after outages and process by event content, not arrival time.
  • Verify signatures and timestamp freshness on every request.
  • Monitor delivery failures and retry backlog so you can respond before events age out.