# Operational maturity audit

**Scope:** Observability, recovery tooling, backup verification hooks, alerting, runbooks — as reflected **in repo** plus governance docs.

---

## Executive summary

The codebase has **meaningful operational scaffolding**: scheduled tasks in `bootstrap/app.php`, ops-oriented Artisan commands in `routes/console.php`, **System health** and **Integrations health** admin pages, webhook **dead-letter** tooling, and governance documents (`docs/governance/*`) that function as **runbook-level policy**.

**Gaps:** Runbooks are **not fully executable** as copy-paste playbooks per subsystem (payroll, HR) inside `docs/` yet — policy exists; **service-specific** steps still rely on engineering memory.

---

## Observability coverage

| Signal | Present? | Notes |
|--------|-----------|------|
| System health dashboard | Yes | `SystemHealthPageController` + services |
| Integrations health | Yes | Dead letters, retry storms, API security summary |
| API audit logs | Yes | `api_audit_logs` + middleware |
| Security events | Yes | `api_security_events` for selected API failures |
| Webhook delivery rows | Yes | Status / attempts / timestamps |

**Recommendation:** Add **SLO dashboards** (external) wiring these SQL queries or exporting metrics — out of code scope but maturity dependency.

---

## Recovery tooling

Documented / implemented commands include:

- `ops:webhooks:retry-failed`, `ops:events:replay`, `ops:webhooks:dead-letter-inspect`
- Tenancy, attendance, retention, backup recording commands (see `routes/console.php`)

**Gap:** No single **`ops:platform:status`** “golden” command — optional convenience only.

---

## Backup verification

Commands exist to **record** backup success/failure and verify restore markers (`ops:backup:*`).

**Maturity test:** Are these invoked from **real schedulers / CI**? Repo alone cannot prove runtime wiring.

**Recommendation:** Tie backup verification to **monthly drill** ticket with evidence attachment.

---

## Alerting coverage

Code supports **anomaly scan** scheduling; alerting destination (email, PagerDuty) is **environment-specific**.

**Recommendation:** Document **who gets paged** for: payroll job failures, export failures, webhook DLQ spikes, backup failures.

---

## Runbook completeness

| Topic | Policy doc | Executable runbook in-repo |
|-------|------------|-----------------------------|
| Incidents | `docs/governance/incident-response.md` | Partial |
| Releases | `docs/governance/release-management.md` | Partial |
| Data retention | `docs/governance/data-governance.md` | Partial |

**Recommendation:** For each SEV1 class, add **1-page tactical runbook** (`docs/runbooks/…`) with commands + SQL guardrails — future work.

---

## References

- `bootstrap/app.php` (`withSchedule`)
- `routes/console.php`
- `docs/governance/incident-response.md`
- `docs/governance/release-management.md`
