Operator runbook

Knocker is intentionally embedded. There is no built-in admin UI or operator HTTP API, so production recovery starts from the binding surface or admin route you own.

This page is the “3 AM” checklist.

First checks

Start by listing recent events and deliveries for the endpoint you care about: recent events, invalid deliveries, and orphan deliveries. Use the Operator surface page for binding-specific calls.

Use Event rows to understand processing state. Use Delivery rows to understand what HTTP receipts arrived.

Provider says it sent the webhook, but my handler did not run

Check for an orphan delivery first.

If a delivery is orphaned with signature_valid=False, Knocker stored the receipt but did not create or enqueue an event because verification failed.

If there is a linked delivery, inspect the event with get_event(...).

Common statuses:

received: stored and waiting for a worker.
processing: currently claimed by a worker or still inside a visibility window.
failed: handler failed but may be retried or requeued.
dead: retries exhausted; explicit operator recovery is needed.
ignored: operator or application policy intentionally skipped it.
handled: handler finished and the queue ack committed.

Event is dead

Dead events do not automatically reactivate when the provider retries. Duplicate ingress is non-mutating: Knocker stores another Delivery, keeps the existing Event dead, and does not enqueue work.

Pick the recovery you actually mean. Use requeue(...) when you want to process the canonical event payload again.

Use replay_delivery(...) when you want to process one specific stored delivery body, for example a provider redelivery whose payload differs from the canonical event body.

Both paths reset attempt_count to 0, so the dead-letter clock starts over.

Event should not run

Ignore events that are still safe to skip. ignore(...) is accepted from received, failed, and dead. It is a no-op when the event is already ignored, and it rejects processing and handled.

If a queued job for the event is claimed later, the worker short-circuits it without invoking the handler.

Worker died

If run_worker(...) raises outside normal handler retry/dead-letter handling, Knocker records local worker state and calls on_error before re-raising. Inspect worker state through your binding’s worker-state helper where available.

Knocker does not supervise your process. Wrap run_worker(...) in your app’s normal task supervision or process manager, and restart it the same way you restart the rest of your application.

What restart confidence Knocker actually tests

Knocker does not claim special crash semantics beyond the SQLite transactions it uses.

What is explicitly covered in the repo today:

a real subprocess-kill test where ingress work is interrupted before commit; the fresh reopen sees no phantom Event, Delivery, or queued job
committed ingress survives a fresh-process reopen and is still processable
expired claims can be reclaimed after reopen and drained normally

So the mental model is intentionally boring:

if the relevant SQLite transaction committed, Knocker state committed
if it rolled back or never committed, Knocker state did not durably change

Representative local benchmark command:

uv run --group dev python bench/knocker_bench.py --ingest-n 5000 --worker-n 5000

Representative local numbers on an Apple M1 Pro, Python 3.13.5, SQLite 3.49.1:

durable ingress-only: 5,000 events in 1.445s (3,460/s, 0.289 ms/event)
no-op-handler worker drain: 5,000 events in 1.836s (2,723/s, 0.367 ms/event)

These numbers assume the ordinary hot path: one long-lived knocker.open(...) per process. Ingress and workers may run in separate processes against the same SQLite file, but many independent handles inside one process are a degraded contention mode.

Before pruning

Pruning is explicit and irreversible. Preview candidates with read APIs first, then prune with a bounded threshold. Prune old orphan deliveries separately. Every prune call writes a durable audit row. Use the Retention and pruning page for binding-specific calls.

Audit rows include what was deleted, when, and which filters were used. Even a no-op prune (zero rows deleted) records an audit row so the operator timeline is unambiguous.

When to inspect deliveries

Reach for deliveries when the question is about what arrived:

Was the signature valid?
Did the provider retry?
Did the redelivery body differ from the canonical event body?
Was a receipt stored but not linked to an event?

Reach for events when the question is about processing:

Did this handler run?
Is the event dead or failed?
Should the operator requeue, replay, ignore, or prune it?