Status Dashboard Redesign — 2026-06-10
Design artifacts for the Console /console/dashboard Infrastructure Status
panel redesign. Three distinct layout variants, each solving a different primary
operator need. Filed against the operator's "harder to monitor" / "needs a
redesign" feedback with the existing screenshot.
Mockup index
| File | Viewport | Screenshot |
|---|---|---|
variant-a-triage-first.html |
1440x900 | screenshots/variant-a-desktop.png |
variant-a-triage-first.html |
390x844 | screenshots/variant-a-mobile.png |
variant-b-map-drill.html |
1440x900 | screenshots/variant-b-desktop.png |
variant-b-map-drill.html |
390x844 | screenshots/variant-b-mobile.png |
variant-c-timeseries.html |
1440x900 | screenshots/variant-c-desktop.png |
variant-c-timeseries.html |
390x844 | screenshots/variant-c-mobile.png |
Variant A — Triage-first
Primary question answered: "What do I need to look at right now?"
Design intent: Degraded services float to the top as full-width incident cards with explicit fields — HTTP status, latency, 48h availability, when the degradation started, and whether the last deploy could be correlated. Healthy services collapse into a single expandable pill: "10 of 10 nominal." The green-state hero makes an affirmative calm-signal possible for the first time — if nothing is degraded, the page leads with "12 of 12 nominal" in moss.
What the current design doesn't do that this fixes: - Equal visual weight for all services (degraded and healthy look identical) - No hero signal when everything is fine - No deploy correlation in the incident view - No stale-probe warning when the last check is old
Trade-offs:
- Requires ranking logic: what counts as "degraded" vs "elevated" vs "nominal."
Current code uses ok: bool — this design adds a soft "elevated" tier for
latency spikes without HTTP error codes. Feature-developer will need to
implement a severity: ok | elevated | degraded field on SiteStatus.
- Incident cards are verbose; on mobile the action row wraps to a second line.
Acceptable: action buttons are the right CTA even on mobile.
Mobile behavior: Nav and clock hidden below 640px. Incident action buttons stack horizontally below the incident body. Nominal list columns collapse to name + availability only.
Variant B — Map + Drill
Primary question answered: "When X degrades, what else is affected?"
Design intent: A static SVG dependency graph shows the relationship between services — clients to api-prod to console-prod, velvet-prod, queue-prod, and then from console-prod to raxx-mockups and raxx-app, and api-prod to postmark. Edge color reflects health propagation: a red dashed edge from console-prod to raxx-mockups communicates degradation without requiring the operator to read a list. Clicking a node opens the drill panel on the right showing full detail for that service.
The degraded raxx-mockups node is shown as selected in this mockup (halo ring, highlighted border). The drill panel shows liveness, 48h availability bar, last deploy, and a "downstream impact" section confirming no cascading blast radius.
What the current design doesn't do that this fixes: - No dependency information at all — blast radius is invisible - No spatial layout that communicates topology - The right panel (current design) shows data for a pre-selected service only; there is no way to navigate the service graph
Trade-offs:
- The dependency graph is static topology, not data-driven. It requires encoding
upstream_of relationships in SurfaceConfig. This is a small YAML change
but it's a new concept.
- On mobile (< 900px), the graph stacks above the drill panel. A 320px graph
height still communicates the topology but is harder to tap. Node tap targets
should be at least 44px — the SVG rects in this mockup are 42px tall. Feature-
developer should add 6px padding to touch targets.
- The SVG is manually laid out. A force-directed layout library would be more
scalable as the surface registry grows, but introduces a dependency.
For 12 services, static SVG is sufficient.
Mobile behavior: Below 900px, graph and drill panel stack vertically. Below 480px, the clock is hidden and page padding tightens.
Variant C — Timeseries
Primary question answered: "Are there patterns I'm not seeing from a single snapshot?"
Design intent: Each service is a horizontal strip. The dominant visual element for each row is a 48-segment availability bar (each segment = ~30 minutes of history), overlaid with bronze deploy markers at the precise point where a deploy happened. The latency sparkline in the adjacent column shows 24h p50 (solid moss line) and p95 (lighter red line). Together, these let the operator spot: - Latency creep over time (postmark row shows both lines trending upward) - Deploy-correlated incidents (raxx-mockups shows a deploy marker near the start of the current outage cluster) - Recurring outage patterns (a service with 3 down-segments spread across the bar has a different character than one with a single long outage)
Services are grouped by provider taxonomy: Heroku apps / Cloudflare Pages / External SaaS / Internal.
A summary stat bar at the top gives the at-a-glance count (nominal, degraded, elevated, fleet uptime, deploy count).
What the current design doesn't do that this fixes: - Availability bars exist but carry no deploy markers - No latency history — only the current snapshot value - No provider grouping - No fleet-level summary stat
Trade-offs:
- This layout is dense. On a 1440px wide screen, the availability strip and
sparkline are each given proportional space. On 768px, the sparkline and
latency columns are hidden, leaving only name + availability bar + action.
This is intentional — the availability bar conveys the most value on mobile.
- The sparkline is SVG rendered inline with hardcoded polyline points in this
mockup. Feature-developer needs to generate SVG polyline data server-side or
compute it client-side from the poll_log API.
- The deploy marker data-label attribute currently shows in CSS ::after
pseudo-content. On mobile this label overlaps adjacent markers if two deploys
are close together. Feature-developer should add a tooltip on tap instead.
Mobile behavior: Below 768px, sparkline and latency columns hidden; strips show name + availability bar + action only. Below 480px, page padding tightens and clock is hidden.
Design decisions made without human input
-
"Elevated latency" as a third health tier (amber / warning), distinct from HTTP-level degradation. Implemented for postmark (HTTP 200 but 1,240ms vs 380ms p50 baseline). Feature-developer needs to implement the threshold logic — suggest: elevated when current latency > 2x the rolling 24h p50.
-
Deploy marker color is antler-bronze (#B08D57), not moss or danger. This avoids conflating deploy events with health status — a deploy is a change event, not a failure event.
-
Provider grouping taxonomy matches
SurfaceConfig.providervalues:heroku|cloudflare_pages|lightsail+freescoutgrouped as "External SaaS" |console_self+cron_heartbeatgrouped as "Internal." -
Stale probe threshold: show yellow "stale" state when
now() - checked_at > liveness_interval_s * 1.5. At the default 300s interval this triggers at 450s (7.5 minutes). -
On Variant B, raxx-mockups is shown pre-selected in the drill panel because it is the most operationally relevant item. Feature-developer should default- select the first degraded service on page load, falling back to the first elevated service, falling back to the first service in the registry.
Open questions (must resolve before feature-developer picks up)
-
Which variant to implement? Variants A and C can reasonably be combined (triage hero + timeseries strips). Variant B is a distinct implementation path. Operator decision needed.
-
Severity thresholds. What p50 multiplier defines "elevated latency"? 2x? 1.5x? Is a minimum absolute floor needed (e.g., never elevate if latency < 200ms regardless of multiplier)?
-
Dependency topology for Variant B. Do we want to encode
upstream_of: [list]inSurfaceConfig, or maintain a separate adjacency YAML? The latter is more readable but adds a file to maintain. -
Alert log integration (WHATS-MISSING.md item 6). "Last alerted at" data requires either a new table or adding an
alerted_atfield toconsole_poll_log. Is this in scope for the same feature card? -
p50/p95 query performance.
console_poll_logis append-only with one row per poll per surface (~12 surfaces × 288 polls/day = ~3,456 rows/day). At this volume a rolling percentile query is cheap. But if the poll interval is shortened post-launch, an index on(surface_id, checked_at)will be needed. Architect should confirm index exists before feature-developer wires the sparkline endpoint.
Implementation cards cross-link
PM is filing implementation cards in parallel. TODO: cross-link issue numbers here once filed. Suggest one card per variant + one shared card for the data layer (p50/p95 endpoint, stale-probe logic, severity tier).