Raxx · internal docs

internal · gated ↑ index

RCA — FreeScout cloud-init FATAL: could not extract MySQL root password

Incident ID: 2026-04-29-freescout-cloud-init-fatal Date: 2026-04-29 Severity: SEV-2 Duration: ~2.5 hours elapsed (first apply attempt → root cause identified + template patched; install not yet completed on existing instance — manual recovery pending operator execution) Blast radius: FreeScout ticketing system (tickets.raxx.app) was never successfully bootstrapped; site served Bitnami default page. No customer data affected (pre-launch system). Operator was unable to use the ticketing system. Author: sre-agent

Summary

Six consecutive Terraform apply attempts failed to bootstrap FreeScout on a Lightsail LAMP instance. The bootstrap script (cloud-init user-data) exited with a FATAL error claiming it could not read the MariaDB root password from /home/bitnami/bitnami_credentials. Two compounding bugs caused the failure: the MariaDB readiness wait loop used the wrong service name (mysql instead of mariadb), wasting 150 s; and the credentials poll loop used a POSIX sh &&-in-if pattern that triggers set -e under dash, causing the script to exit silently on the first iteration rather than polling for 5 minutes. The script exited 17 seconds before Bitnami wrote the actual credentials. The fix is a template patch to user_data.sh.tpl (two corrections + extended poll window) and a manual recovery run on the existing instance.

Timeline (all times UTC)

Impact

What went well

What didn't go well

Root cause analysis

Detection

Resolution

Template patch (durable fix for future applies): - terraform/freescout/templates/user_data.sh.tpl: - ctlscript.sh status mysqlctlscript.sh status mariadb; capture output to variable, then grep the variable (avoids pipeline set -e issues) - Replaced if [ -f file ] && grep -q ... with nested if statements + a CREDS_READY flag variable - Extended credentials poll from 60 × 5s to 120 × 5s (10 min) - Updated FATAL message to say "10 min" - terraform/README.md lessons-learned: added three new entries documenting the mariadb service name, the dash set -e + && gotcha, and the Bitnami credential write timing

Manual recovery on existing instance (required — template fix only applies to future instances): Operator must run the manual recovery commands from docs/ops/runbooks/freescout.md (Failure mode A section) on the existing raxx-tickets instance. The instance is healthy (MariaDB running, Apache running, all binaries in place); only the FreeScout application layer is missing.

Validation (after manual recovery): - curl -sI https://tickets.raxx.app/ returns 200 with FreeScout/Laravel headers - Cloudflare Access login gate prompts on first visit - cloud-init status — still error (cannot be fixed without a new instance; acceptable since the install will have succeeded manually)

Action items

# Action Owner Due Issue
1 Run manual FreeScout recovery on raxx-tickets instance per runbook Failure mode A operator 2026-04-29
2 After recovery: take a Lightsail manual snapshot (aws lightsail create-instance-snapshot) as bootstrap checkpoint operator 2026-04-29
3 Add post-apply smoke test: SSH to instance after apply, assert bootstrap log ends with "completed at" sre-agent / operator 2026-05-06 TBD
4 Add Cloudflare health check on tickets.raxx.app that alerts when response body contains "Bitnami" sre-agent / operator 2026-05-06 TBD
5 Audit all other user-data / cloud-init scripts in the repo for set -e + &&-in-if patterns under dash sre-agent 2026-05-06 TBD

References