Raxx · internal docs

internal · gated

What the Current Status Dashboard Is Missing

Findings against the current /console/dashboard Infrastructure Status panel as of 2026-06-10. Data source notes reference status_poller.py and site_probes.py. Items marked "data exists" mean console_poll_log or surface_status_cache already carries the field — it just isn't surfaced.


1. Green-state hero signal

What's missing: When all 12 services are healthy, the dashboard looks identical to when one is degraded — 12 colored dots with equal visual weight. There is no "all systems nominal" affirmative signal.

Impact: Operator must visually scan all 12 rows to confirm nothing is wrong. This is slow and fatiguing.

Fix direction: Add a single hero banner — "12 of 12 nominal" in moss with a clear timestamp — that collapses all rows and requires zero scanning.

Data: Already computed from the in-memory cache. Zero new data.


2. Poll-history sparklines (latency over time)

What's missing: Each service shows one latency number (e.g., "803 ms" for raxx-mockups). You cannot tell whether 803 ms is an anomaly or the steady-state norm for that surface.

Impact: "Is this an incident or always slow?" requires memory, not data.

Fix direction: Render a 24h latency sparkline (p50 + p95) per service row. Each data point = one poll result from console_poll_log.

Data exists: console_poll_log records latency_ms per poll. Query: SELECT checked_at, latency_ms FROM console_poll_log WHERE surface_id = ? ORDER BY checked_at DESC LIMIT 288 (288 = 24h at 5-min intervals).


3. Deploy markers on availability bars

What's missing: The 48h availability bars exist but carry no deploy information. When raxx-mockups went down at 14:18 UTC and a deploy happened at 11:24 UTC, the correlation is invisible.

Impact: Post-incident questions ("did the deploy cause it?") require cross- referencing two unrelated views.

Fix direction: Overlay vertical bronze diamond markers on the availability strip at the timestamp of each deploy within the window.

Data exists: DeployInfo.deployed_at is already in the cache per surface. Multiple deploys would need querying console_poll_log deploy entries, which is stored.


4. p95 vs p50 latency distinction

What's missing: One latency number per service. The raxx-mockups probe returned 803 ms — but the 24h p50 is ~187 ms and p95 is ~614 ms. Seeing only 803 ms makes it look like a 5x spike, when it may be a 3x spike on an already- slow surface.

Impact: Severity assessment is wrong without baseline context.

Fix direction: Show both p50 and p95 in the detail view, and plot both on the sparkline with different line weights (solid = p50, dashed = p95).

Data exists: console_poll_log.latency_ms — compute percentiles over a rolling 24h window.


5. Time-since-last-incident per service ("12 days clean")

What's missing: No per-service "clean streak" indicator. A service that has been healthy for 90 days communicates differently from one that had three incidents last week — but both show the same green dot.

Impact: Trust calibration is manual and memory-based.

Fix direction: Compute "last incident at" from console_poll_log where status_code != 200 or ok = false. Display as "clean 12d" or "last incident 2h ago."

Data exists: console_poll_log has checked_at and ok per poll.


6. "Last alerted at" per service and alert destination

What's missing: No indication of whether an alert was sent for the current or most-recent incident, or where it went (ops@, Slack, PagerDuty).

Impact: Operator cannot tell from the dashboard whether ops@ has already been notified, or whether they need to act manually.

Fix direction: Add "alerted at HH:MM UTC via ops@" to the incident card or detail panel.

Data: Needs a new alert_log table or a field on console_poll_log. This is the one item in this list that requires a small schema addition.


7. Mobile-responsive layout

What's missing: The current dashboard table layout breaks on viewports under ~900px. The operator confirmed /console/beta was clipped on iPhone earlier today; the status panel has the same grid-based structure and is likely broken identically.

Impact: Operator cannot monitor status from a phone during an incident when they are away from a desktop.

Fix direction: Responsive breakpoints at 640px — collapse the right detail panel into a full-width modal/sheet, reduce the left-rail service list to name + dot only, expand on tap. All three mockup variants in this PR demonstrate viable mobile layouts.

Data: No data change required.


8. Dependency / blast-radius visualization

What's missing: No indication of which services are upstream or downstream of others. When api-prod degrades, does velvet-prod follow? When raxx-mockups goes down, does anything else care?

Impact: During an incident, operator must hold the dependency map in their head.

Fix direction: Static dependency edges (source of truth: the architecture, not polling data) encoded in SURFACE_REGISTRY_BY_ENV or a separate adjacency YAML. Render as either the graph variant (B) or as an "affected by" text list in the detail panel.

Data: Topology is static knowledge; no polling data needed. Add an upstream_of: [list] field to SurfaceConfig.


9. Keyboard navigation (J/K between services, Enter to drill)

What's missing: All interaction requires mouse/touch. No keyboard shortcuts.

Impact: Power operator workflow is slower than it needs to be; accessibility gap.

Fix direction: Implement J/K to move focus between service rows, Enter to open the detail panel for the focused row, Escape to close. This matches the mental model of the gh CLI and htop.

Data: Frontend only; no backend changes.


10. Compact / icon-only view toggle

What's missing: Each service row occupies full height even when healthy. For at-a-glance monitoring when all services are green, most of that space is noise.

Fix direction: A "compact" toggle collapses each row to a colored dot + service name only (24px row height). Degraded rows auto-expand.

Data: Frontend only.


11. Surface taxonomy grouping

What's missing: All 12 services appear in a flat list with equal visual weight. There is no grouping by provider or criticality.

Impact: It is harder to spot that postmark (SaaS, no deploy control) is behaving differently from api-prod (Heroku, operator-controlled).

Fix direction: Group rows by taxonomy: Heroku apps / Cloudflare Pages / External SaaS / Internal. The provider field on SurfaceConfig already carries this.

Data: SurfaceConfig.provider already set for all surfaces.


12. Stale-probe warning (41s age prominently surfaced)

What's missing: When a probe is stale (last check was 41s ago against a 5-min interval), the dashboard shows the same dot color as a fresh probe. The stale check for raxx-mockups means the displayed HTTP 503 may be 41 seconds old and the service could have recovered or worsened.

Fix direction: Show a yellow "stale" badge on the service row when now() - checked_at > liveness_interval_s * 1.5. Suppress the current dot color and show "stale" until a fresh probe lands.

Data: SiteStatus.checked_at is in the cache. Arithmetic only.


13. Site flags and secrets sections (currently empty placeholders)

What's missing: The current detail panel has "Site Flags" and "Secrets" sections that show "—" for most services. These were presumably intended to show feature-flag state or secret-rotation status but were never wired.

Fix direction: Either wire them (feature-developer task) or remove the empty sections. Empty placeholder sections erode trust in the dashboard — they suggest data is missing when none was ever promised.

Data: Feature flags: query feature_flags.yaml for the surface's app. Secrets rotation status: requires Velvet integration (out of scope for this design). Remove or wire — don't show empty.


14. Recent audit log in context

What's missing: The "Recent Audit" section in the detail panel has no entries for most services. When it does show entries, there is no visual connection between the audit event and the availability chart (e.g., a "flag promoted" audit event that coincided with an incident).

Fix direction: Surface the last 3 audit entries for the selected service in the drill panel, with timestamps aligned to the availability bar so the operator can see whether a config change preceded a degradation.

Data: console_poll_log + audit log tables already exist.