Raxx · internal docs

internal · gated

Gatekeeper — develop to release runbook

System: .github/workflows/gatekeeper-develop-to-release.yml Owner: sre-agent / operator Last incident: 2026-06-30 (key leak + 403 push failure — see security note below) Last reviewed: 2026-06-30


Security note — 2026-06-30 — raxx-ops-bot App private key exposed in logs

Incident: The initial run of cut-release-candidate.yml logged the raxx-ops-bot GitHub App RSA private key in cleartext (base64 PEM lines visible in step output).

Root cause: load-vault-secrets/action.yml used echo "::add-mask::${VALUE}" to register secrets for masking. GitHub Actions' ::add-mask:: command only processes the FIRST line of its argument; subsequent lines of a multiline PEM key appeared unmasked in the step log.

Fix applied (this PR): Replaced the single-echo mask with a per-line loop in load-vault-secrets/action.yml so every line of every fetched secret is individually masked before the value is written to GITHUB_ENV.

Operator action required — rotate the raxx-ops-bot App private key: The private key was briefly visible in GitHub Actions step logs. Although Actions logs are restricted to repo collaborators, the key must be treated as compromised and rotated.

# 1. Generate a new private key in the GitHub App settings:
#    GitHub → Settings → Developer settings → GitHub Apps → raxx-ops-bot
#    → Private keys → Generate a private key → download the .pem file

# 2. Update the key in Infisical vault:
#    Vault path: /MooseQuest/raxx-ops-bot/PRIVATE_KEY_PEM
#    Update via the Infisical UI or API with the new PEM content.

# 3. Revoke the old private key in the GitHub App settings.

Until the key is rotated, treat any operations authenticated as raxx-ops-bot during the exposure window as potentially unauthorized.


Model change — 2026-06-30

The gatekeeper was redesigned from a push-triggered model to a tag-triggered model.

Old model (removed): on: push: branches: [develop] — fired on every push to develop, polled ci.yml for up to 30 minutes per push. On a rapid merge-train this fired repeatedly, churned cancelled runs, and one run hit the 30-min timeout and failed.

New model: on: push: tags: ['release-*'] — fires only when a release-* tag is pushed. The tag is the explicit "this develop commit is release-worthy" signal. Fixes accumulate on develop freely and roll up under the next release tag.

The cut-release-candidate workflow (cut-release-candidate.yml) is the standard entry point for creating a release tag.


What the gatekeeper does (new model)

When a release-* tag is pushed to the repository:

  1. Asserts the tagged commit is on develop — refuses to promote a tag that is not an ancestor of origin/develop.
  2. Verifies the tagged SHA's ci.yml run concluded success — short bounded check (10 probes × 30 s = 5 min max). Fails fast if CI is not green.
  3. Idempotency — if the tagged commit is already in release, exits cleanly with no action.
  4. Merges developrelease with --no-ff (merge commit per ADR-0115 §Merge strategy).
  5. Pushes release — the push fires the existing staging deploy workflows automatically (deploy-heroku.yml, deploy-antlers-next-staging.yml, deploy-console.yml, deploy-queue.yml, deploy-velvet.yml). No deploy logic lives in the gatekeeper.

What cut-release-candidate does

cut-release-candidate.yml is a workflow_dispatch workflow that:

  1. Checks out develop HEAD.
  2. Verifies ci.yml is green for that commit (non-polling — fails immediately if not complete).
  3. Computes the release tag name (release-YYYY.MM.DD or release-YYYY.MM.DD-<sha7> if a tag for today already exists).
  4. Creates an annotated tag on develop HEAD and pushes it.
  5. The tag push fires gatekeeper-develop-to-release.yml automatically.

Going to production (release → main)

The release → main boundary is governed by a semver tag (v*.*.*). Pushing a semver tag triggers promote-release-to-main.yml, which runs the heavy scan suite and on success merges release → main (→ prod deploy). See docs/ops/runbooks/promote-release-to-main.md (or the workflow header) for the full operator procedure.


Operator commands

Cut a release candidate and promote develop → staging:

# Standard path — let cut-release-candidate.yml do the work:
gh workflow run cut-release-candidate.yml

# Monitor the tag creation:
gh run list --workflow=cut-release-candidate.yml --limit=3

# Monitor the gatekeeper promotion:
gh run list --workflow=gatekeeper-develop-to-release.yml --limit=3

# Monitor staging deploys:
gh run list --workflow=deploy-heroku.yml --branch=release --limit=5

Ship a prod release (release → main via semver tag):

# Apply a semver tag to the tip of the release branch:
git fetch origin release
git tag -a v1.4.0 origin/release -m "Release v1.4.0"
git push origin v1.4.0

# Monitor the heavy-scan + promotion:
gh run list --workflow=promote-release-to-main.yml --limit=3

# Monitor prod deploys:
gh run list --workflow=deploy-heroku.yml --branch=main --limit=5

Manual gatekeeper trigger (after a freeze or when re-tagging):

# Re-push an existing tag (force) to re-fire the gatekeeper on the same SHA:
git fetch origin --tags
git push origin refs/tags/release-2026.06.30 --force

# Or cut a new tag for the same SHA (preferred — creates an audit record):
git tag -a release-2026.06.30-abc1234 <sha> -m "Re-cut RC"
git push origin release-2026.06.30-abc1234

How to tell the gatekeeper is broken


How to diagnose (in order)

  1. Find the relevant gatekeeper run: bash gh run list --workflow=gatekeeper-develop-to-release.yml --limit=5

  2. View the logs for the failed run: bash gh run view <run-id> --log

  3. Identify which step failed: - "Load raxx-ops-bot credentials" — vault unreachable or INFISICAL_ secrets stale. - "Mint raxx-ops-bot installation token" — GitHub App credentials invalid or expired. - "Assert tagged commit is on develop" — the tag was applied to a commit not on develop. - "Verify ci.yml is green for tagged commit" — ci.yml failed or not yet completed for the tagged SHA. - "Merge develop into release"* — merge conflict or push permission error.

  4. For CI gate failures, find the specific ci.yml run: bash gh run list --workflow=ci.yml --branch=develop --limit=5 gh run view <run-id> --log-failed


Known failure modes

Failure mode A: ci.yml not green for tagged commit

Symptom: Step "Verify ci.yml is green for tagged commit" exits with ci.yml concluded 'failure' or times out after 5 min.

Cause A1 (failure conclusion): The tagged commit's ci.yml run failed. This is the gate working correctly — a red commit was tagged by mistake.

Fix: Fix the failing check on develop, then cut a new release tag:

# After the fix PR merges to develop:
gh workflow run cut-release-candidate.yml

Cause A2 (timeout — no completed run within 5 min): ci.yml was never triggered for this commit, OR it is queued behind other runs. The 5-min bounded check is designed to be short precisely because we only tag commits we believe are already green.

Fix: Verify ci.yml ran for the tagged SHA:

TAG_SHA=$(git rev-parse refs/tags/<tag-name>)
gh api "repos/raxx-app/TradeMasterAPI/actions/workflows/ci.yml/runs?head_sha=${TAG_SHA}&per_page=5" \
  --jq '.workflow_runs[] | {id, status, conclusion, html_url}'

If ci.yml succeeded but the API returned it slowly, re-push the tag to re-fire the gatekeeper:

git push origin refs/tags/<tag-name> --force

Verification: gh run list --workflow=gatekeeper-develop-to-release.yml --limit=3


Failure mode B: tagged commit not on develop

Symptom: Step "Assert tagged commit is on develop" exits with "NOT on origin/develop."

Cause: The release-* tag was applied to a commit that is not in the develop branch history — possibly a hotfix branch commit, a detached HEAD, or the wrong SHA.

Fix: Delete the erroneous tag and re-apply it to the correct develop commit:

# Delete the bad tag locally and remotely:
git push origin :refs/tags/<tag-name>
git tag -d <tag-name>

# Apply to the correct develop commit:
git fetch origin develop
git tag -a <tag-name> origin/develop -m "RC <tag-name>"
git push origin <tag-name>

Failure mode C: merge conflict on promote step

Symptom: Step "Merge develop into release" exits with a git merge conflict error.

Cause: Someone pushed directly to release outside this workflow (e.g. a manual hotfix merge), creating a divergence from develop.

This is an escalation. Do NOT force-push to resolve it. The conflict represents a real divergence that requires human triage.

Fix (operator action): 1. Identify what is on release but not on develop: bash git log origin/develop..origin/release --oneline 2. Cherry-pick the hotfix commits to develop: bash git fetch origin develop git checkout develop git cherry-pick <hotfix-sha> git push origin develop 3. Cut a new release tag once develop is clean: bash gh workflow run cut-release-candidate.yml

See ADR-0115 §Emergency hotfix path for the full hotfix procedure.


Failure mode D: YAML syntax error in workflow file (run duration = 0 s)

Symptom: Every gatekeeper run fails instantly — start_time equals end_time, no step logs are produced.

Cause: A change to gatekeeper-develop-to-release.yml introduced a YAML syntax error. Common trigger: a multi-line string inside a run: | literal block where inner lines are at column 0. YAML treats those as the end of the block.

This was the failure mode that triggered the original redesign (recurring workflow failures, 2026-06-30).

Fix: Validate the workflow YAML before pushing:

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/gatekeeper-develop-to-release.yml').read()); print('OK')"
actionlint .github/workflows/gatekeeper-develop-to-release.yml

Failure mode E: vault unreachable

Symptom: Step "Load raxx-ops-bot credentials" fails with a vault auth error.

Fix: See docs/ops/runbooks/vault-access.md. Once vault is restored, re-push the tag to re-fire the gatekeeper:

git push origin refs/tags/<tag-name> --force

Failure mode F: release branch protection blocks the push (GH006)

Symptom: Step "Merge develop into release" exits with GH006 (protected branch update failed).

Cause: The release branch protection requires status checks or a reviewer for direct pushes, which conflicts with the bot push from the gatekeeper.

Correct release protection state for the tag-gated model: - required_status_checks: null (none) - required_pull_request_reviews: null (none) - allow_force_pushes: false - allow_deletions: false - enforce_admins: false

The CI gate lives inside the gatekeeper workflow itself. The heavy PR-gate scans (ZAP, Playwright, etc.) are on the release → main boundary (promote-release-to-main.yml), not here.

Fix:

gh api -X PUT repos/raxx-app/TradeMasterAPI/branches/release/protection \
  --input - <<'JSON'
{
  "required_status_checks": null,
  "required_pull_request_reviews": null,
  "restrictions": null,
  "enforce_admins": false,
  "allow_force_pushes": false,
  "allow_deletions": false
}
JSON

# Verify
gh api repos/raxx-app/TradeMasterAPI/branches/release/protection \
  --jq '{has_checks: (.required_status_checks != null), has_reviews: (.required_pull_request_reviews != null)}'
# → {"has_checks":false,"has_reviews":false}

# Re-push the tag to re-fire the gatekeeper
git push origin refs/tags/<tag-name> --force

How to pause auto-promotion (maintenance freeze)

gh workflow disable gatekeeper-develop-to-release.yml

While disabled, no promotions fire regardless of how many release-* tags are pushed. When ready to resume:

gh workflow enable gatekeeper-develop-to-release.yml
# Cut a new tag to trigger promotion of the current develop HEAD:
gh workflow run cut-release-candidate.yml

Failure mode G: push fails with 403 — App token lacks Contents:write

Symptom: Step "Merge develop into release" (or "Create and push annotated release tag" in cut-release-candidate.yml) exits with a 403. The run log shows either: - remote: Permission to org/repo.git denied to raxx-ops-bot[bot]. - The ::warning:: line: App token push failed — raxx-ops-bot App needs Contents:write on this repo.

Cause: The raxx-ops-bot GitHub App installation does not have Contents: Read & Write permission on this repository. The App token mints successfully but the push operation is denied.

Note on GITHUB_TOKEN: GITHUB_TOKEN is NOT a valid fix here. GitHub Actions docs specify that pushes authenticated with GITHUB_TOKEN do not trigger other workflow runs — using it would silently break the downstream deploy trigger chain.

Permanent fix (preferred least-privilege end state):

GitHub → Settings → Developer settings → GitHub Apps → raxx-ops-bot
→ Permissions → Repository permissions → Contents: Read & Write → Save

After saving, re-run the failed workflow:

gh run rerun <failed-run-id>

Required App permissions (full end state):

Permission Level Required by
Contents Read & Write Push tags + push branch merge commits
Pull requests Read & Write release.yml (release-please opens/updates PRs)

Immediate fix (while App permission is pending): Add the RAXX_OPS_BOT_PAT repo secret (classic GitHub PAT with repo scope):

GitHub → Repo settings → Secrets and variables → Actions → New repository secret
Name: RAXX_OPS_BOT_PAT
Value: <classic PAT from the raxx-ops-bot machine user account, scope: repo>

Once set, re-run the failed workflow — the push step falls back to the PAT automatically and emits a ::warning:: annotation. The warning persists until the App is granted Contents:write, at which point the PAT fallback is unused.


Manual promote (when gatekeeper is disabled or paused)

git fetch origin develop release
git checkout -B release origin/release
git merge --no-ff origin/develop -m "chore(release): manual promote develop->release $(date -u +%Y.%m.%d)"
git push origin release

Concurrency model

The gatekeeper uses concurrency: group: gatekeeper-develop-to-release, cancel-in-progress: false.


Emergency stop

# Cancel an in-flight run:
gh run cancel $(gh run list --workflow=gatekeeper-develop-to-release.yml \
  --status=in_progress --json databaseId --jq '.[0].databaseId')

If the push to release completed before cancellation, staging will have received a deploy. Roll back staging if needed:

gh workflow run deploy-heroku.yml -f environment=staging -f ref=<previous-good-sha>

Escalation

Wake the operator when: - Failure mode C (merge conflict) — requires human decision on the divergent hotfix. - The gatekeeper has failed 3+ times in a row. - Vault is unreachable for > 30 min. - The release branch has been force-pushed or its protection removed.


References


How to tell the gatekeeper is broken


How to diagnose (in order)

  1. Find the relevant gatekeeper run: gh run list --workflow=gatekeeper-develop-to-release.yml --limit=5

  2. View the logs for the failed run: gh run view <run-id> --log

  3. Identify which step failed: - "Load raxx-ops-bot credentials" — vault unreachable or INFISICAL_ secrets stale. - "Mint raxx-ops-bot installation token" — GitHub App credentials invalid or expired. - "Wait for CI — develop" — ci.yml failed or timed out (see ci.yml run for details). - "Push release tag" — tag collision or push permission error. - "Promote develop → release"* — merge conflict or push permission error.

  4. For CI gate failures, find the specific ci.yml run: gh run list --workflow=ci.yml --branch=develop --limit=5 gh run view <run-id> --log-failed

  5. For push permission errors, verify the raxx-ops-bot GitHub App has contents: write on this repository.


Known failure modes

Failure mode A: CI gate timeout (30 min)

Symptom: Step "Wait for CI — develop" exits with "Timed out after 1800s."

Cause: ci.yml is taking more than 30 min, OR runner queue pressure delayed the ci.yml run, OR the ci.yml run was never created (possible if workflow file is broken).

Fix:

# Confirm ci.yml ran for the SHA
SHA=$(git rev-parse origin/develop)
gh api "repos/raxx-app/TradeMasterAPI/actions/workflows/ci.yml/runs?head_sha=${SHA}&per_page=5" \
  --jq '.workflow_runs[] | {id, status, conclusion, html_url}'

If ci.yml ran and succeeded, re-trigger the gatekeeper manually:

gh workflow run gatekeeper-develop-to-release.yml --ref develop

If ci.yml is genuinely slow (> 30 min), increase MAX_WAIT in the workflow (file a type:reliability ticket first).

Verification: gh run list --workflow=gatekeeper-develop-to-release.yml --limit=3


Failure mode B: CI gate fails (develop CI red)

Symptom: Step "Wait for CI — develop" exits with "CI — develop failure for ."

Cause: This is the gate working as intended. A merge landed on develop with a failing check.

Fix: Fix the failing check on develop (not on a feature branch — the merge is already on develop). Options: - If the failure is a flaky test: manually re-run the ci.yml run, then re-run the gatekeeper. - If the failure is a real bug: open a PR targeting develop with the fix; the gatekeeper fires again after that merge.

Do NOT bypass the gate by manually merging develop → release. The gate exists to keep staging green.

Verification: Staging should not receive a deploy until the fix lands on develop and the gatekeeper passes.


Failure mode C: tag collision

Symptom: Step "Push release tag" exits with "Tag release-YYYY.MM.DD-sha7 already exists but points to a different SHA."

Cause: A tag with this name was manually created on a different commit. Should not occur in normal operation.

Fix:

# Identify the conflicting tag
git fetch origin --tags
git tag -v release-YYYY.MM.DD-<sha7>

# Delete the conflicting tag (operator action — confirm correct SHA first)
git push origin :refs/tags/release-YYYY.MM.DD-<sha7>

# Re-run the gatekeeper
gh workflow run gatekeeper-develop-to-release.yml --ref develop

Verification: git ls-remote origin 'refs/tags/release-*'


Failure mode D: merge conflict on promote step

Symptom: Step "Promote develop → release" exits with a git merge conflict error.

Cause: Someone pushed directly to release outside this workflow (e.g. a manual hotfix merge), creating a divergence from develop.

This is an escalation. Do NOT force-push to resolve it. The conflict represents a real divergence that requires human triage.

Fix (operator action): 1. Identify what is on release but not on develop: bash git log origin/develop..origin/release --oneline 2. Cherry-pick the hotfix commits to develop: bash git checkout develop git cherry-pick <hotfix-sha> git push origin develop 3. The gatekeeper fires again after the cherry-pick merge, and the next merge will succeed.

See ADR-0115 §Emergency hotfix path for the full hotfix procedure.


Failure mode F: YAML syntax error in workflow file (run duration = 0s)

Symptom: Every gatekeeper run fails instantly — start_time equals end_time, no step logs are produced. GitHub Actions may report "This workflow is invalid" or "ScannerError: while scanning a simple key."

Cause: A change to gatekeeper-develop-to-release.yml introduced a YAML syntax error. Common trigger: a multi-line string inside a run: | literal block where inner lines are at column 0. YAML treats those as the end of the block, and any : in subsequent lines is parsed as a YAML key, producing a ScannerError. This class of error is invisible to GitHub's PR preview but fails at runner startup.

This is the failure mode that occurred on 2026-06-30 (16-day gap in promotions).

Fix: Validate the workflow YAML before pushing:

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/gatekeeper-develop-to-release.yml').read()); print('OK')"

If actionlint is available (CI check — see ci-hygiene.md):

actionlint .github/workflows/gatekeeper-develop-to-release.yml

Fix the indentation/quoting error and push a corrected commit to develop.

Verification: A new gatekeeper run appears for the fix commit and progresses past the YAML parse phase (at least one step log is produced).


Failure mode E: vault unreachable

Symptom: Step "Load raxx-ops-bot credentials" fails with a vault auth error.

Cause: Infisical vault is down, or the CF Access service token for CI has expired.

Fix: See docs/ops/runbooks/vault-access.md. Once vault is restored, re-run the gatekeeper:

gh workflow run gatekeeper-develop-to-release.yml --ref develop

Failure mode G: release branch protection blocks the push (GH006)

Symptom: Step "Promote develop → release" exits with:

remote: error: GH006: Protected branch update failed for refs/heads/release
remote: error: Required status checks are expected.

or

remote: error: At least 1 approving review is required ...

Cause: The release branch protection was configured with required status checks and/or a required PR review — settings appropriate for a PR-based gate but NOT compatible with the gatekeeper's direct-push model. The gatekeeper checks develop CI itself before pushing; requiring additional checks or reviews on release directly blocks the bot push and defeats the automation.

Root cause documented: 2026-06-30 — cutover (ADR-0115 Phase 6) applied the PR-gate spec from the ADR to release, which required 9 status checks + 1 reviewer. This contradicted the gatekeeper model where develop CI is the gate.

Correct release protection state: - required_status_checks: null (none) - required_pull_request_reviews: null (none) - allow_force_pushes: false (protected) - allow_deletions: false (protected) - enforce_admins: false

The gate on develop CI lives INSIDE the gatekeeper workflow (step "Wait for CI — develop to complete"). Heavy scans (ZAP, Queue Docker, Playwright e2e, etc.) gate the release → main boundary (prod promotion), NOT the develop → release boundary.

Fix:

# Verify current state (should have no required_status_checks or required_pull_request_reviews)
gh api repos/raxx-app/TradeMasterAPI/branches/release/protection

# If either field is present, clear them:
gh api -X PUT repos/raxx-app/TradeMasterAPI/branches/release/protection \
  --input - <<'JSON'
{
  "required_status_checks": null,
  "required_pull_request_reviews": null,
  "restrictions": null,
  "enforce_admins": false,
  "allow_force_pushes": false,
  "allow_deletions": false
}
JSON

# Verify (response should NOT contain required_status_checks or required_pull_request_reviews keys)
gh api repos/raxx-app/TradeMasterAPI/branches/release/protection

# Re-run the last failed gatekeeper run
FAILED_RUN=$(gh run list --workflow=gatekeeper-develop-to-release.yml \
  -R raxx-app/TradeMasterAPI --json databaseId,conclusion \
  --jq '[.[] | select(.conclusion == "failure")] | .[0].databaseId')
gh run rerun "$FAILED_RUN" -R raxx-app/TradeMasterAPI

Note: The gatekeeper does NOT have a workflow_dispatch trigger; use gh run rerun to re-trigger without a new develop push.

Note: After gh run rerun, the rerun's push to release may not fire staging deploy webhooks (GitHub suppresses push events for workflow reruns). The staging deploys will fire correctly on the NEXT normal develop push that triggers the gatekeeper. To force staging deploys immediately after a fix:

# Dispatch each staging deploy workflow manually for the current release HEAD
gh workflow run deploy-heroku.yml -R raxx-app/TradeMasterAPI \
  -f environment=staging -f ref=release

Verification:

# Confirm gatekeeper passed
gh run view "$FAILED_RUN" --json conclusion
# → {"conclusion":"success"}

# Confirm release advanced
gh api repos/raxx-app/TradeMasterAPI/git/ref/heads/release --jq '.object.sha'

# Confirm protection is still correct (no required checks crept back in)
gh api repos/raxx-app/TradeMasterAPI/branches/release/protection \
  --jq '{has_checks: (.required_status_checks != null), has_reviews: (.required_pull_request_reviews != null)}'
# → {"has_checks":false,"has_reviews":false}

How to pause auto-promotion (maintenance freeze)

Option 1 — Disable the workflow (preferred):

GitHub → Actions → Workflows → "Gatekeeper — develop to release" → ... → Disable workflow

Or via CLI:

gh workflow disable gatekeeper-develop-to-release.yml

While disabled, no promotions fire. When ready to resume:

gh workflow enable gatekeeper-develop-to-release.yml

After re-enabling, trigger a manual run to promote any commits that landed during the freeze:

gh workflow run gatekeeper-develop-to-release.yml --ref develop

Option 2 — Block the CI gate (temporary hold on a specific commit): If you want to hold promotion for a specific commit without disabling all future promotions, let the gatekeeper's CI gate naturally block until you're ready. The ci.yml run on that commit will time out after 30 min. For a longer hold, disable the workflow (Option 1).

Option 3 — Branch protection on release (emergency): Set release branch protection to require a human reviewer for direct pushes. This blocks the bot push from the gatekeeper without disabling the workflow, and allows a manual merge when ready.


Manual promote (when gatekeeper is disabled or paused)

To manually promote develop → release when the gatekeeper is disabled:

git fetch origin develop release
git checkout -B release origin/release
git merge --no-ff origin/develop -m "chore(release): manual promote develop → release $(date -u +%Y.%m.%d)"
git push origin release

This pushes to release, which fires the staging deploy workflows automatically.


Concurrency model

The gatekeeper uses concurrency: group: gatekeeper-develop, cancel-in-progress: false. This means:


Emergency stop

To stop a promotion that is currently in progress:

# Cancel the in-flight run
gh run cancel $(gh run list --workflow=gatekeeper-develop-to-release.yml --status=in_progress --json databaseId --jq '.[0].databaseId')

The cancellation will leave release in whatever state it was in before the cancelled merge push (git push is atomic). If the push completed before cancellation, staging will still deploy. In that case, roll back staging via the staging deploy workflow dispatch:

gh workflow run deploy-heroku.yml -f environment=staging -f ref=<previous-good-sha>

Escalation

Wake the operator when: - Failure mode D (merge conflict) — requires human decision on the divergent hotfix. - The gatekeeper has failed 3+ times in a row on the same SHA. - Vault is unreachable for > 30 min. - The release branch has been force-pushed or its protection has been removed.


References