Raxx · internal docs

internal · gated

Status Dashboard Redesign — 2026-06-10

Design artifacts for the Console /console/dashboard Infrastructure Status panel redesign. Three distinct layout variants, each solving a different primary operator need. Filed against the operator's "harder to monitor" / "needs a redesign" feedback with the existing screenshot.


Mockup index

File Viewport Screenshot
variant-a-triage-first.html 1440x900 screenshots/variant-a-desktop.png
variant-a-triage-first.html 390x844 screenshots/variant-a-mobile.png
variant-b-map-drill.html 1440x900 screenshots/variant-b-desktop.png
variant-b-map-drill.html 390x844 screenshots/variant-b-mobile.png
variant-c-timeseries.html 1440x900 screenshots/variant-c-desktop.png
variant-c-timeseries.html 390x844 screenshots/variant-c-mobile.png

Variant A — Triage-first

Primary question answered: "What do I need to look at right now?"

Design intent: Degraded services float to the top as full-width incident cards with explicit fields — HTTP status, latency, 48h availability, when the degradation started, and whether the last deploy could be correlated. Healthy services collapse into a single expandable pill: "10 of 10 nominal." The green-state hero makes an affirmative calm-signal possible for the first time — if nothing is degraded, the page leads with "12 of 12 nominal" in moss.

What the current design doesn't do that this fixes: - Equal visual weight for all services (degraded and healthy look identical) - No hero signal when everything is fine - No deploy correlation in the incident view - No stale-probe warning when the last check is old

Trade-offs: - Requires ranking logic: what counts as "degraded" vs "elevated" vs "nominal." Current code uses ok: bool — this design adds a soft "elevated" tier for latency spikes without HTTP error codes. Feature-developer will need to implement a severity: ok | elevated | degraded field on SiteStatus. - Incident cards are verbose; on mobile the action row wraps to a second line. Acceptable: action buttons are the right CTA even on mobile.

Mobile behavior: Nav and clock hidden below 640px. Incident action buttons stack horizontally below the incident body. Nominal list columns collapse to name + availability only.


Variant B — Map + Drill

Primary question answered: "When X degrades, what else is affected?"

Design intent: A static SVG dependency graph shows the relationship between services — clients to api-prod to console-prod, velvet-prod, queue-prod, and then from console-prod to raxx-mockups and raxx-app, and api-prod to postmark. Edge color reflects health propagation: a red dashed edge from console-prod to raxx-mockups communicates degradation without requiring the operator to read a list. Clicking a node opens the drill panel on the right showing full detail for that service.

The degraded raxx-mockups node is shown as selected in this mockup (halo ring, highlighted border). The drill panel shows liveness, 48h availability bar, last deploy, and a "downstream impact" section confirming no cascading blast radius.

What the current design doesn't do that this fixes: - No dependency information at all — blast radius is invisible - No spatial layout that communicates topology - The right panel (current design) shows data for a pre-selected service only; there is no way to navigate the service graph

Trade-offs: - The dependency graph is static topology, not data-driven. It requires encoding upstream_of relationships in SurfaceConfig. This is a small YAML change but it's a new concept. - On mobile (< 900px), the graph stacks above the drill panel. A 320px graph height still communicates the topology but is harder to tap. Node tap targets should be at least 44px — the SVG rects in this mockup are 42px tall. Feature- developer should add 6px padding to touch targets. - The SVG is manually laid out. A force-directed layout library would be more scalable as the surface registry grows, but introduces a dependency. For 12 services, static SVG is sufficient.

Mobile behavior: Below 900px, graph and drill panel stack vertically. Below 480px, the clock is hidden and page padding tightens.


Variant C — Timeseries

Primary question answered: "Are there patterns I'm not seeing from a single snapshot?"

Design intent: Each service is a horizontal strip. The dominant visual element for each row is a 48-segment availability bar (each segment = ~30 minutes of history), overlaid with bronze deploy markers at the precise point where a deploy happened. The latency sparkline in the adjacent column shows 24h p50 (solid moss line) and p95 (lighter red line). Together, these let the operator spot: - Latency creep over time (postmark row shows both lines trending upward) - Deploy-correlated incidents (raxx-mockups shows a deploy marker near the start of the current outage cluster) - Recurring outage patterns (a service with 3 down-segments spread across the bar has a different character than one with a single long outage)

Services are grouped by provider taxonomy: Heroku apps / Cloudflare Pages / External SaaS / Internal.

A summary stat bar at the top gives the at-a-glance count (nominal, degraded, elevated, fleet uptime, deploy count).

What the current design doesn't do that this fixes: - Availability bars exist but carry no deploy markers - No latency history — only the current snapshot value - No provider grouping - No fleet-level summary stat

Trade-offs: - This layout is dense. On a 1440px wide screen, the availability strip and sparkline are each given proportional space. On 768px, the sparkline and latency columns are hidden, leaving only name + availability bar + action. This is intentional — the availability bar conveys the most value on mobile. - The sparkline is SVG rendered inline with hardcoded polyline points in this mockup. Feature-developer needs to generate SVG polyline data server-side or compute it client-side from the poll_log API. - The deploy marker data-label attribute currently shows in CSS ::after pseudo-content. On mobile this label overlaps adjacent markers if two deploys are close together. Feature-developer should add a tooltip on tap instead.

Mobile behavior: Below 768px, sparkline and latency columns hidden; strips show name + availability bar + action only. Below 480px, page padding tightens and clock is hidden.


Design decisions made without human input

  1. "Elevated latency" as a third health tier (amber / warning), distinct from HTTP-level degradation. Implemented for postmark (HTTP 200 but 1,240ms vs 380ms p50 baseline). Feature-developer needs to implement the threshold logic — suggest: elevated when current latency > 2x the rolling 24h p50.

  2. Deploy marker color is antler-bronze (#B08D57), not moss or danger. This avoids conflating deploy events with health status — a deploy is a change event, not a failure event.

  3. Provider grouping taxonomy matches SurfaceConfig.provider values: heroku | cloudflare_pages | lightsail + freescout grouped as "External SaaS" | console_self + cron_heartbeat grouped as "Internal."

  4. Stale probe threshold: show yellow "stale" state when now() - checked_at > liveness_interval_s * 1.5. At the default 300s interval this triggers at 450s (7.5 minutes).

  5. On Variant B, raxx-mockups is shown pre-selected in the drill panel because it is the most operationally relevant item. Feature-developer should default- select the first degraded service on page load, falling back to the first elevated service, falling back to the first service in the registry.


Open questions (must resolve before feature-developer picks up)

  1. Which variant to implement? Variants A and C can reasonably be combined (triage hero + timeseries strips). Variant B is a distinct implementation path. Operator decision needed.

  2. Severity thresholds. What p50 multiplier defines "elevated latency"? 2x? 1.5x? Is a minimum absolute floor needed (e.g., never elevate if latency < 200ms regardless of multiplier)?

  3. Dependency topology for Variant B. Do we want to encode upstream_of: [list] in SurfaceConfig, or maintain a separate adjacency YAML? The latter is more readable but adds a file to maintain.

  4. Alert log integration (WHATS-MISSING.md item 6). "Last alerted at" data requires either a new table or adding an alerted_at field to console_poll_log. Is this in scope for the same feature card?

  5. p50/p95 query performance. console_poll_log is append-only with one row per poll per surface (~12 surfaces × 288 polls/day = ~3,456 rows/day). At this volume a rolling percentile query is cheap. But if the poll interval is shortened post-launch, an index on (surface_id, checked_at) will be needed. Architect should confirm index exists before feature-developer wires the sparkline endpoint.


PM is filing implementation cards in parallel. TODO: cross-link issue numbers here once filed. Suggest one card per variant + one shared card for the data layer (p50/p95 endpoint, stale-probe logic, severity tier).