Incident ID: 2026-04-29-freescout-cloud-init-fatal
Date: 2026-04-29
Severity: SEV-2
Duration: ~2.5 hours elapsed (first apply attempt → root cause identified + template patched; install not yet completed on existing instance — manual recovery pending operator execution)
Blast radius: FreeScout ticketing system (tickets.raxx.app) was never successfully bootstrapped; site served Bitnami default page. No customer data affected (pre-launch system). Operator was unable to use the ticketing system.
Author: sre-agent
Six consecutive Terraform apply attempts failed to bootstrap FreeScout on a Lightsail LAMP instance. The bootstrap script (cloud-init user-data) exited with a FATAL error claiming it could not read the MariaDB root password from /home/bitnami/bitnami_credentials. Two compounding bugs caused the failure: the MariaDB readiness wait loop used the wrong service name (mysql instead of mariadb), wasting 150 s; and the credentials poll loop used a POSIX sh &&-in-if pattern that triggers set -e under dash, causing the script to exit silently on the first iteration rather than polling for 5 minutes. The script exited 17 seconds before Bitnami wrote the actual credentials. The fix is a template patch to user_data.sh.tpl (two corrections + extended poll window) and a manual recovery run on the existing instance.
ctlscript.sh status mysql returned "Unknown service mysql" every time — loop never broke[ -f credentials_file ] succeeds; grep -q "default password is '" returns 1 (file had placeholder). dash set -e triggered on && right-hand operand — script exited with no error message at this point in the loopcat of credentials file wrote placeholder to log/var/log/freescout-bootstrap.log last write (bootstrap log mtime confirmed)/home/bitnami/bitnami_credentials (file mtime confirmed; 17 s after script exited)errorhttps://tickets.raxx.app/ served Bitnami default page instead of FreeScout; Cloudflare Access gate appeared but led to a non-functional pagebitnami_credentials format was documented in the template comments, narrowing the search spaceset -e + && dash gotcha is not documented anywhere in the existing lessons-learnedmysql vs mariadb) was not caught by any local test or dry-runContributing factor 1: Wrong ctlscript.sh service name — The MariaDB readiness loop called ctlscript.sh status mysql. On Bitnami LAMP AMIs, the database service is registered as mariadb. The mysql subcommand returns "Unknown service mysql" (exit non-zero). The loop ran all 30 iterations × 5s = 150 s, consuming most of the time before Bitnami's own init completed. The system allowed this because the loop's if condition silently failed every iteration without any error log output.
Contributing factor 2: dash set -e + &&-in-if exits script on grep mismatch — The credentials poll loop used if [ -f file ] && grep -q pattern file; then. Under POSIX sh / dash (which Bitnami's cloud-init uses regardless of the #!/usr/bin/env bash shebang), set -e applies to the right-hand operand of && even inside if conditions. When [ -f file ] succeeded but grep -q returned 1 (placeholder doesn't match "default password is '"), dash applied set -e and exited the script immediately. The for loop body ran zero useful iterations. The system allowed this because the dash-vs-bash behavior difference is underdocumented and the existing lessons-learned already noted that dash ignores the shebang — but did not address the set -e + && interaction.
Contributing factor 3: Insufficient polling window after accounting for MariaDB wait time — The credentials poll was designed for 60 × 5s = 5 min, starting from script launch. But the MariaDB loop consumed 150 s first, leaving only 3.5 min for credential polling. Bitnami writes credentials ~2–3 min after instance boot, which fell within the 5-min window in isolation — but not after the wasted 150 s. The fix (120 × 5s = 10 min) provides a margin of ~7 min for credential polling regardless of how long the MariaDB loop takes.
cloud-init status: error and site still serving Bitnami default page after 6th apply attempt/var/log/freescout-bootstrap.log ends with completed at rather than FATAL; add a Cloudflare health check that alerts when tickets.raxx.app/ returns a body containing "Bitnami" or a non-200 statusTemplate patch (durable fix for future applies):
- terraform/freescout/templates/user_data.sh.tpl:
- ctlscript.sh status mysql → ctlscript.sh status mariadb; capture output to variable, then grep the variable (avoids pipeline set -e issues)
- Replaced if [ -f file ] && grep -q ... with nested if statements + a CREDS_READY flag variable
- Extended credentials poll from 60 × 5s to 120 × 5s (10 min)
- Updated FATAL message to say "10 min"
- terraform/README.md lessons-learned: added three new entries documenting the mariadb service name, the dash set -e + && gotcha, and the Bitnami credential write timing
Manual recovery on existing instance (required — template fix only applies to future instances):
Operator must run the manual recovery commands from docs/ops/runbooks/freescout.md (Failure mode A section) on the existing raxx-tickets instance. The instance is healthy (MariaDB running, Apache running, all binaries in place); only the FreeScout application layer is missing.
Validation (after manual recovery):
- curl -sI https://tickets.raxx.app/ returns 200 with FreeScout/Laravel headers
- Cloudflare Access login gate prompts on first visit
- cloud-init status — still error (cannot be fixed without a new instance; acceptable since the install will have succeeded manually)
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Run manual FreeScout recovery on raxx-tickets instance per runbook Failure mode A |
operator | 2026-04-29 | — |
| 2 | After recovery: take a Lightsail manual snapshot (aws lightsail create-instance-snapshot) as bootstrap checkpoint |
operator | 2026-04-29 | — |
| 3 | Add post-apply smoke test: SSH to instance after apply, assert bootstrap log ends with "completed at" | sre-agent / operator | 2026-05-06 | TBD |
| 4 | Add Cloudflare health check on tickets.raxx.app that alerts when response body contains "Bitnami" |
sre-agent / operator | 2026-05-06 | TBD |
| 5 | Audit all other user-data / cloud-init scripts in the repo for set -e + &&-in-if patterns under dash |
sre-agent | 2026-05-06 | TBD |
docs/ops/runbooks/freescout.mdterraform/freescout/templates/user_data.sh.tplterraform/README.md (three new entries)