Skip to content

Operator runbook

Knocker is intentionally embedded. There is no built-in admin UI or operator HTTP API, so production recovery starts from the binding surface or admin route you own.

This page is the “3 AM” checklist.

Start by listing recent events and deliveries for the endpoint you care about: recent events, invalid deliveries, and orphan deliveries. Use the Operator surface page for binding-specific calls.

Use Event rows to understand processing state. Use Delivery rows to understand what HTTP receipts arrived.

Provider says it sent the webhook, but my handler did not run

Section titled “Provider says it sent the webhook, but my handler did not run”

Check for an orphan delivery first.

If a delivery is orphaned with signature_valid=False, Knocker stored the receipt but did not create or enqueue an event because verification failed.

If there is a linked delivery, inspect the event with get_event(...).

Common statuses:

  • received: stored and waiting for a worker.
  • processing: currently claimed by a worker or still inside a visibility window.
  • failed: handler failed but may be retried or requeued.
  • dead: retries exhausted; explicit operator recovery is needed.
  • ignored: operator or application policy intentionally skipped it.
  • handled: handler finished and the queue ack committed.

Dead events do not automatically reactivate when the provider retries. Duplicate ingress is non-mutating: Knocker stores another Delivery, keeps the existing Event dead, and does not enqueue work.

Pick the recovery you actually mean. Use requeue(...) when you want to process the canonical event payload again.

Use replay_delivery(...) when you want to process one specific stored delivery body, for example a provider redelivery whose payload differs from the canonical event body.

Both paths reset attempt_count to 0, so the dead-letter clock starts over.

Ignore events that are still safe to skip. ignore(...) is accepted from received, failed, and dead. It is a no-op when the event is already ignored, and it rejects processing and handled.

If a queued job for the event is claimed later, the worker short-circuits it without invoking the handler.

If run_worker(...) raises outside normal handler retry/dead-letter handling, Knocker records local worker state and calls on_error before re-raising. Inspect worker state through your binding’s worker-state helper where available.

Knocker does not supervise your process. Wrap run_worker(...) in your app’s normal task supervision or process manager, and restart it the same way you restart the rest of your application.

What restart confidence Knocker actually tests

Section titled “What restart confidence Knocker actually tests”

Knocker does not claim special crash semantics beyond the SQLite transactions it uses.

What is explicitly covered in the repo today:

  • a real subprocess-kill test where ingress work is interrupted before commit; the fresh reopen sees no phantom Event, Delivery, or queued job
  • committed ingress survives a fresh-process reopen and is still processable
  • expired claims can be reclaimed after reopen and drained normally

So the mental model is intentionally boring:

  • if the relevant SQLite transaction committed, Knocker state committed
  • if it rolled back or never committed, Knocker state did not durably change

Representative local benchmark command:

Terminal window
uv run --group dev python bench/knocker_bench.py --ingest-n 5000 --worker-n 5000

Representative local numbers on an Apple M1 Pro, Python 3.13.5, SQLite 3.49.1:

  • durable ingress-only: 5,000 events in 1.445s (3,460/s, 0.289 ms/event)
  • no-op-handler worker drain: 5,000 events in 1.836s (2,723/s, 0.367 ms/event)

These numbers assume the ordinary hot path: one long-lived knocker.open(...) per process. Ingress and workers may run in separate processes against the same SQLite file, but many independent handles inside one process are a degraded contention mode.

Pruning is explicit and irreversible. Preview candidates with read APIs first, then prune with a bounded threshold. Prune old orphan deliveries separately. Every prune call writes a durable audit row. Use the Retention and pruning page for binding-specific calls.

Audit rows include what was deleted, when, and which filters were used. Even a no-op prune (zero rows deleted) records an audit row so the operator timeline is unambiguous.

Reach for deliveries when the question is about what arrived:

  • Was the signature valid?
  • Did the provider retry?
  • Did the redelivery body differ from the canonical event body?
  • Was a receipt stored but not linked to an event?

Reach for events when the question is about processing:

  • Did this handler run?
  • Is the event dead or failed?
  • Should the operator requeue, replay, ignore, or prune it?