RCA — FreeScout cloud-init FATAL: could not extract MySQL root password
Incident ID: 2026-04-29-freescout-cloud-init-fatal
Date: 2026-04-29
Severity: SEV-2
Duration: ~2.5 hours elapsed (first apply attempt → root cause identified + template patched; install not yet completed on existing instance — manual recovery pending operator execution)
Blast radius: FreeScout ticketing system (tickets.raxx.app) was never successfully bootstrapped; site served Bitnami default page. No customer data affected (pre-launch system). Operator was unable to use the ticketing system.
Author: sre-agent
Summary
Six consecutive Terraform apply attempts failed to bootstrap FreeScout on a Lightsail LAMP instance. The bootstrap script (cloud-init user-data) exited with a FATAL error claiming it could not read the MariaDB root password from /home/bitnami/bitnami_credentials. Two compounding bugs caused the failure: the MariaDB readiness wait loop used the wrong service name (mysql instead of mariadb), wasting 150 s; and the credentials poll loop used a POSIX sh &&-in-if pattern that triggers set -e under dash, causing the script to exit silently on the first iteration rather than polling for 5 minutes. The script exited 17 seconds before Bitnami wrote the actual credentials. The fix is a template patch to user_data.sh.tpl (two corrections + extended poll window) and a manual recovery run on the existing instance.
Timeline (all times UTC)
- 05:09:10 — Bootstrap script started on 6th apply attempt (cloud-init user-data began)
- 05:09:10–05:11:40 — MariaDB readiness loop ran all 30 iterations × 5s (150 s);
ctlscript.sh status mysqlreturned "Unknown service mysql" every time — loop never broke - 05:11:40 — MariaDB loop exhausted; credentials poll loop started
- 05:11:40 — Iteration 1:
[ -f credentials_file ]succeeds;grep -q "default password is '"returns 1 (file had placeholder). dashset -etriggered on&&right-hand operand — script exited with no error message at this point in the loop - 05:11:41 — Script reached FATAL branch (MYSQL_ROOT_PASS was empty, never extracted);
catof credentials file wrote placeholder to log - 05:11:41 —
/var/log/freescout-bootstrap.loglast write (bootstrap log mtime confirmed) - 05:11:58 — Bitnami wrote real credentials to
/home/bitnami/bitnami_credentials(file mtime confirmed; 17 s after script exited) - 05:11:59 — cloud-init finished with status
error - ~2026-04-29 — sre-agent diagnosed root causes via SSH inspection; patched template; wrote runbook and RCA
Impact
- Users affected: 0 (pre-launch system; Kristerpher is the only user)
- User-visible symptoms:
https://tickets.raxx.app/served Bitnami default page instead of FreeScout; Cloudflare Access gate appeared but led to a non-functional page - Data integrity: ok (no FreeScout data ever written)
- Revenue / billing: ok
What went well
- Static IP and Cloudflare DNS/Access resources were applied correctly by Terraform; the infrastructure layer was healthy
- The existing instance is intact — no data was lost, no disk is corrupt
- The bootstrap log captured the FATAL with the placeholder file contents, making the timing diagnosable
bitnami_credentialsformat was documented in the template comments, narrowing the search space
What didn't go well
- The same bootstrap failure occurred six times with no change to the root cause between attempts — each apply recreated the instance with the same broken script
- Two distinct bugs combined to cause the failure; fixing only one would not have resolved the incident
- The
set -e+&&dash gotcha is not documented anywhere in the existing lessons-learned - The wrong service name (
mysqlvsmariadb) was not caught by any local test or dry-run - 150 s of wasted sleep per apply attempt was not visible in the Terraform output — the failure looked identical regardless of whether MariaDB came up or not
Root cause analysis
-
Contributing factor 1: Wrong ctlscript.sh service name — The MariaDB readiness loop called
ctlscript.sh status mysql. On Bitnami LAMP AMIs, the database service is registered asmariadb. Themysqlsubcommand returns "Unknown service mysql" (exit non-zero). The loop ran all 30 iterations × 5s = 150 s, consuming most of the time before Bitnami's own init completed. The system allowed this because the loop'sifcondition silently failed every iteration without any error log output. -
Contributing factor 2: dash
set -e+&&-in-ifexits script on grep mismatch — The credentials poll loop usedif [ -f file ] && grep -q pattern file; then. Under POSIX sh / dash (which Bitnami's cloud-init uses regardless of the#!/usr/bin/env bashshebang),set -eapplies to the right-hand operand of&&even insideifconditions. When[ -f file ]succeeded butgrep -qreturned 1 (placeholder doesn't match"default password is '"), dash appliedset -eand exited the script immediately. The for loop body ran zero useful iterations. The system allowed this because the dash-vs-bash behavior difference is underdocumented and the existing lessons-learned already noted that dash ignores the shebang — but did not address theset -e+&&interaction. -
Contributing factor 3: Insufficient polling window after accounting for MariaDB wait time — The credentials poll was designed for 60 × 5s = 5 min, starting from script launch. But the MariaDB loop consumed 150 s first, leaving only 3.5 min for credential polling. Bitnami writes credentials ~2–3 min after instance boot, which fell within the 5-min window in isolation — but not after the wasted 150 s. The fix (120 × 5s = 10 min) provides a margin of ~7 min for credential polling regardless of how long the MariaDB loop takes.
Detection
- What alerted us: Operator reported
cloud-init status: errorand site still serving Bitnami default page after 6th apply attempt - How long between cause and detection: ~2.5 hours (6 apply cycles, each requiring a recreate + wait)
- How to detect faster next time: add a post-apply smoke test that SSHes to the instance after cloud-init completes and verifies
/var/log/freescout-bootstrap.logends withcompleted atrather thanFATAL; add a Cloudflare health check that alerts whentickets.raxx.app/returns a body containing "Bitnami" or a non-200 status
Resolution
Template patch (durable fix for future applies):
- terraform/freescout/templates/user_data.sh.tpl:
- ctlscript.sh status mysql → ctlscript.sh status mariadb; capture output to variable, then grep the variable (avoids pipeline set -e issues)
- Replaced if [ -f file ] && grep -q ... with nested if statements + a CREDS_READY flag variable
- Extended credentials poll from 60 × 5s to 120 × 5s (10 min)
- Updated FATAL message to say "10 min"
- terraform/README.md lessons-learned: added three new entries documenting the mariadb service name, the dash set -e + && gotcha, and the Bitnami credential write timing
Manual recovery on existing instance (required — template fix only applies to future instances):
Operator must run the manual recovery commands from docs/ops/runbooks/freescout.md (Failure mode A section) on the existing raxx-tickets instance. The instance is healthy (MariaDB running, Apache running, all binaries in place); only the FreeScout application layer is missing.
Validation (after manual recovery):
- curl -sI https://tickets.raxx.app/ returns 200 with FreeScout/Laravel headers
- Cloudflare Access login gate prompts on first visit
- cloud-init status — still error (cannot be fixed without a new instance; acceptable since the install will have succeeded manually)
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Run manual FreeScout recovery on raxx-tickets instance per runbook Failure mode A |
operator | 2026-04-29 | — |
| 2 | After recovery: take a Lightsail manual snapshot (aws lightsail create-instance-snapshot) as bootstrap checkpoint |
operator | 2026-04-29 | — |
| 3 | Add post-apply smoke test: SSH to instance after apply, assert bootstrap log ends with "completed at" | sre-agent / operator | 2026-05-06 | TBD |
| 4 | Add Cloudflare health check on tickets.raxx.app that alerts when response body contains "Bitnami" |
sre-agent / operator | 2026-05-06 | TBD |
| 5 | Audit all other user-data / cloud-init scripts in the repo for set -e + &&-in-if patterns under dash |
sre-agent | 2026-05-06 | TBD |
References
- Runbook:
docs/ops/runbooks/freescout.md - Template fix:
terraform/freescout/templates/user_data.sh.tpl - Lessons learned:
terraform/README.md(three new entries) - Related: no prior incidents (first bootstrap attempt for this system)