Operator runbook
Knocker is intentionally embedded. There is no built-in admin UI or operator HTTP API, so production recovery starts from the binding surface or admin route you own.
This page is the “3 AM” checklist.
First checks
Section titled “First checks”Start by listing recent events and deliveries for the endpoint you care about: recent events, invalid deliveries, and orphan deliveries. Use the Operator surface page for binding-specific calls.
Use Event rows to understand processing state. Use Delivery rows to understand what HTTP receipts arrived.
Provider says it sent the webhook, but my handler did not run
Section titled “Provider says it sent the webhook, but my handler did not run”Check for an orphan delivery first.
If a delivery is orphaned with signature_valid=False, Knocker stored the receipt but did not create or enqueue an event because verification failed.
If there is a linked delivery, inspect the event with get_event(...).
Common statuses:
received: stored and waiting for a worker.processing: currently claimed by a worker or still inside a visibility window.failed: handler failed but may be retried or requeued.dead: retries exhausted; explicit operator recovery is needed.ignored: operator or application policy intentionally skipped it.handled: handler finished and the queue ack committed.
Event is dead
Section titled “Event is dead”Dead events do not automatically reactivate when the provider retries. Duplicate ingress is non-mutating: Knocker stores another Delivery, keeps the existing Event dead, and does not enqueue work.
Pick the recovery you actually mean. Use requeue(...) when you want to
process the canonical event payload again.
Use replay_delivery(...) when you want to process one specific stored delivery body, for example a provider redelivery whose payload differs from the canonical event body.
Both paths reset attempt_count to 0, so the dead-letter clock starts over.
Event should not run
Section titled “Event should not run”Ignore events that are still safe to skip. ignore(...) is accepted from
received, failed, and dead. It is a no-op when the event is already
ignored, and it rejects processing and handled.
If a queued job for the event is claimed later, the worker short-circuits it without invoking the handler.
Worker died
Section titled “Worker died”If run_worker(...) raises outside normal handler retry/dead-letter
handling, Knocker records local worker state and calls on_error before
re-raising. Inspect worker state through your binding’s worker-state helper
where available.
Knocker does not supervise your process. Wrap run_worker(...) in your app’s normal task supervision or process manager, and restart it the same way you restart the rest of your application.
What restart confidence Knocker actually tests
Section titled “What restart confidence Knocker actually tests”Knocker does not claim special crash semantics beyond the SQLite transactions it uses.
What is explicitly covered in the repo today:
- a real subprocess-kill test where ingress work is interrupted before commit; the fresh reopen sees no phantom
Event,Delivery, or queued job - committed ingress survives a fresh-process reopen and is still processable
- expired claims can be reclaimed after reopen and drained normally
So the mental model is intentionally boring:
- if the relevant SQLite transaction committed, Knocker state committed
- if it rolled back or never committed, Knocker state did not durably change
Representative local benchmark command:
uv run --group dev python bench/knocker_bench.py --ingest-n 5000 --worker-n 5000Representative local numbers on an Apple M1 Pro, Python 3.13.5, SQLite 3.49.1:
- durable ingress-only:
5,000events in1.445s(3,460/s,0.289 ms/event) - no-op-handler worker drain:
5,000events in1.836s(2,723/s,0.367 ms/event)
These numbers assume the ordinary hot path: one long-lived
knocker.open(...) per process. Ingress and workers may run in separate
processes against the same SQLite file, but many independent handles inside
one process are a degraded contention mode.
Before pruning
Section titled “Before pruning”Pruning is explicit and irreversible. Preview candidates with read APIs first, then prune with a bounded threshold. Prune old orphan deliveries separately. Every prune call writes a durable audit row. Use the Retention and pruning page for binding-specific calls.
Audit rows include what was deleted, when, and which filters were used. Even a no-op prune (zero rows deleted) records an audit row so the operator timeline is unambiguous.
When to inspect deliveries
Section titled “When to inspect deliveries”Reach for deliveries when the question is about what arrived:
- Was the signature valid?
- Did the provider retry?
- Did the redelivery body differ from the canonical event body?
- Was a receipt stored but not linked to an event?
Reach for events when the question is about processing:
- Did this handler run?
- Is the event dead or failed?
- Should the operator requeue, replay, ignore, or prune it?