Raxx · internal docs

internal · gated

Heroku log drain → S3 runbook

System: Heroku log drain → Lambda → S3 (raxx-heroku-logs-prod) Owner: sre-agent / operator Last reviewed: 2026-06-19 Issue: #3450 (scope: S3 retention only — no Athena, no Glue) IaC: terraform/log-drain/ Lambda: lambdas/log-drain-receiver/handler.py Wire script: scripts/ops/wire-heroku-log-drain.sh


Overview

All Heroku application log output (stdout + stderr) from production apps drains to S3 via Heroku's HTTPS Logplex drain feature.

Heroku Logplex  →  HTTPS drain  →  Lambda Function URL  →  gzip object  →  S3

A small Lambda function (raxx-heroku-log-drain-receiver) receives each POST batch from Heroku, validates a bearer token, gzip-compresses the batch, and writes it to S3 with a time-partitioned key.

Query/search layer: deferred to a future Splunk setup. S3 objects are immediately accessible via direct S3 GET or the AWS console. No Athena, Glue, or other query service is configured by this stack.

Apps draining

Heroku app Surface
raxx-api-prod Raptor backend (API)
raxx-console-prod Console ops app
raxx-velvet-prod Velvet credential rotation service

Key resources

Resource Identifier
S3 bucket raxx-heroku-logs-prod (us-east-1)
S3 key layout logs/<app>/<YYYY>/<MM>/<DD>/<epoch_ms>-<uuid>.gz
Lambda function raxx-heroku-log-drain-receiver (us-east-1)
Lambda IAM role raxx-heroku-log-drain-receiver
SSM path prefix /raxx/log-drain/

SSM parameters

Name Type Contents
/raxx/log-drain/bucket_name String raxx-heroku-logs-prod
/raxx/log-drain/bucket_arn String arn:aws:s3:::raxx-heroku-logs-prod
/raxx/log-drain/drain_token SecureString Bearer token validated by Lambda
/raxx/log-drain/drain_url SecureString Full drain URL (Lambda URL + ?token=)

Lifecycle

Age Tier Behavior
0–90 days S3 Standard Directly readable
90–365 days S3-IA Directly readable (lower cost, higher per-GET)
> 365 days Deleted No further retention

S3-IA chosen over Glacier so logs remain directly accessible without a restore step (compatible with future Splunk ingestion).


How to tell it's broken


How to diagnose (in order)

  1. Check Heroku drain registration for all three apps: bash heroku drains --app raxx-api-prod --json heroku drains --app raxx-console-prod --json heroku drains --app raxx-velvet-prod --json Expected: each returns at least one drain entry with a URL matching the Lambda Function URL domain (*.lambda-url.us-east-1.on.aws).

  2. Check S3 for recent objects (last 10 minutes worth): bash DATE=$(date -u +%Y/%m/%d) aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/${DATE}/ \ --region us-east-1 | tail -5 Expected: at least a few objects if the app has had any traffic.

  3. Check Lambda logs for errors: bash aws logs tail /aws/lambda/raxx-heroku-log-drain-receiver \ --since 30m --region us-east-1 Look for drain auth failed (wrong token), drain write failed (S3 issue), or Python tracebacks.

  4. Verify the drain URL is still valid in SSM: bash # Check that the SSM parameter exists (don't print the value) aws ssm get-parameter --name /raxx/log-drain/drain_url \ --region us-east-1 --query Parameter.Name --output text Expected: /raxx/log-drain/drain_url

  5. Test the Lambda endpoint directly (replace with actual URL from SSM): ```bash DRAIN_URL=$(aws ssm get-parameter \ --name /raxx/log-drain/drain_url \ --with-decryption \ --query Parameter.Value \ --output text \ --region us-east-1)

# Send a synthetic syslog batch curl -s -o /dev/null -w "%{http_code}" \ -X POST "$DRAIN_URL" \ -H "Content-Type: application/logplex-1" \ --data-binary "83 <158>1 $(date -u +%Y-%m-%dT%H:%M:%S.000000+00:00) host raxx-api-prod web.1 - synthetic test line" `` Expected:200`


Known failure modes

Failure mode A: No drains registered (fresh install / drain removed)

Symptom: heroku drains --app raxx-api-prod returns []. No S3 objects.

Cause: Terraform was applied but wire-heroku-log-drain.sh was not run; or a drain was manually removed.

Fix (idempotent, re-registers drains):

export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export HEROKU_API_KEY=<your-key>
bash scripts/ops/wire-heroku-log-drain.sh

Verification: heroku drains --app raxx-api-prod returns a drain URL; S3 objects appear within 5 minutes.


Failure mode B: Lambda returns 403 (wrong drain token)

Symptom: curl -s -o /dev/null -w "%{http_code}" "$DRAIN_URL" returns 403. Lambda CloudWatch Logs show drain auth failed: token_present=True.

Cause: The DRAIN_TOKEN environment variable on the Lambda does not match the ?token= value in the drain URL — this can happen if the token was regenerated (terraform taint + apply) without re-registering the Heroku drains.

Fix:

# 1. Re-apply Terraform (regenerates token, updates Lambda env var + SSM)
cd terraform/log-drain
terraform taint random_password.drain_token
terraform apply

# 2. Re-register Heroku drains with new URL
bash scripts/ops/wire-heroku-log-drain.sh

# 3. Remove the old drain (different URL) from each app
# Get current drain IDs first:
heroku drains --app raxx-api-prod --json \
  | python3 -c "import json,sys; [print(d['id'], d['url'][:60]) for d in json.load(sys.stdin)]"

# Remove old drain by ID:
heroku drains:remove <old-drain-id> --app raxx-api-prod

Verification: curl returns 200; S3 objects appear within 5 minutes.


Failure mode C: Lambda cannot write to S3 (IAM)

Symptom: Lambda CloudWatch Logs show drain write failed with AccessDenied or NoSuchBucket.

Cause: IAM role policy for raxx-heroku-log-drain-receiver is missing or the bucket was accidentally deleted/renamed.

Diagnosis:

# Confirm Lambda role policy via AWS console or:
aws iam get-role-policy \
  --role-name raxx-heroku-log-drain-receiver \
  --policy-name raxx-heroku-log-drain-receiver-policy \
  --region us-east-1

# Confirm bucket exists:
aws s3 ls s3://raxx-heroku-logs-prod/ --region us-east-1

Fix: Re-apply terraform/log-drain to restore the IAM policy or bucket. If the bucket was deleted, it must be recreated (lifecycle data is lost — a Terraform prevent_destroy = true guard should prevent accidental deletion).


Failure mode D: Lambda cold-start timeouts (rare)

Symptom: Lambda CloudWatch Logs show Task timed out after 15.00 seconds. Heroku may retry the drain batch.

Cause: Extremely large log batch or S3 latency spike. Lambda timeout is 15 seconds (Heroku drain timeout is 10 seconds).

Fix: Lambda retries are automatic. If this is sustained, increase Lambda memory_size (more memory = more CPU = faster execution) in terraform/log-drain/variables.tf and re-apply.


Provisioning (first-time / recovery)

Prerequisites

Step 1: Apply Terraform

cd terraform/log-drain
terraform init
terraform plan -out=tfplan
# Review: expect ~14 new resources (S3 bucket + versioning + encryption +
# public-access-block + lifecycle, IAM role + policy, Lambda + function URL,
# SSM params x4, random_password x1)
terraform apply tfplan

Note: The claude-infisical-bootstrap IAM user cannot apply this stack (it lacks s3:CreateBucket, iam:CreateRole, lambda:CreateFunction). Kristerpher must apply from his workstation using his AWS admin credentials.

Step 2: Register Heroku drains

export HEROKU_API_KEY=<prod-api-key-from-vault>
bash scripts/ops/wire-heroku-log-drain.sh

Step 3: Verify

# Drain registration
heroku drains --app raxx-api-prod
heroku drains --app raxx-console-prod
heroku drains --app raxx-velvet-prod

# Wait 5 minutes, then check for S3 objects
aws s3 ls s3://raxx-heroku-logs-prod/logs/ --region us-east-1 --recursive | head -10

Step 4: Smoke-test the Lambda

DRAIN_URL=$(aws ssm get-parameter \
  --name /raxx/log-drain/drain_url \
  --with-decryption \
  --query Parameter.Value \
  --output text \
  --region us-east-1)

curl -s -o /dev/null -w "HTTP %{http_code}\n" \
  -X POST "$DRAIN_URL" \
  -H "Content-Type: application/logplex-1" \
  --data-binary "83 <158>1 $(date -u +%Y-%m-%dT%H:%M:%S.000000+00:00) host raxx-api-prod web.1 - smoke test from runbook"
# Expected: HTTP 200

# Then verify the object landed in S3
aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/$(date -u +%Y/%m/%d)/ \
  --region us-east-1 | tail -3

Retrieving logs from S3

S3 objects are gzip-compressed, one per Heroku logplex batch (typically 10–100 log lines per object). To read them:

# List objects for a date
aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/ \
  --region us-east-1

# Download and decompress a single object
aws s3 cp s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/<key>.gz - \
  --region us-east-1 | gunzip | head -50

# Search across a day's logs with grep
for key in $(aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/ \
  --region us-east-1 | awk '{print $4}'); do
  aws s3 cp "s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/$key" - \
    --region us-east-1 | gunzip | grep "status=5" || true
done

Future query layer: Splunk will be configured to ingest from S3 directly (#future). Until then, use the direct S3 grep pattern above or the AWS console S3 Select feature for ad-hoc searches.


Token rotation

If the drain token must be rotated (suspected leak):

cd terraform/log-drain

# 1. Force Terraform to regenerate the random token
terraform taint random_password.drain_token

# 2. Apply — this updates the Lambda env var + SSM params atomically
terraform apply -target=random_password.drain_token \
                -target=aws_lambda_function.drain_receiver \
                -target=aws_ssm_parameter.drain_token \
                -target=aws_ssm_parameter.drain_url

# 3. Remove old drains and re-register with new URL
bash scripts/ops/wire-heroku-log-drain.sh

# 4. Remove old drain registrations (different URL)
for APP in raxx-api-prod raxx-console-prod raxx-velvet-prod; do
  OLD_IDS=$(heroku drains --app "$APP" --json \
    | python3 -c "
import json,sys,re
drains = json.load(sys.stdin)
new_host_pattern = 'lambda-url'   # matches Lambda Function URL domain
for d in drains:
    if new_host_pattern not in d.get('url',''):
        print(d['id'])
")
  for ID in $OLD_IDS; do
    echo "Removing old drain $ID from $APP"
    heroku drains:remove "$ID" --app "$APP"
  done
done

Emergency stop — disable all drains

To stop Heroku from sending logs to the endpoint (e.g., endpoint misbehaving):

for APP in raxx-api-prod raxx-console-prod raxx-velvet-prod; do
  IDS=$(heroku drains --app "$APP" --json \
    | python3 -c "import json,sys; [print(d['id']) for d in json.load(sys.stdin)]")
  for ID in $IDS; do
    echo "Removing drain $ID from $APP"
    heroku drains:remove "$ID" --app "$APP"
  done
done

To re-register:

bash scripts/ops/wire-heroku-log-drain.sh

Cost estimate

Resource Volume assumption $/month
S3 Standard storage (0–90 days) ~100 MB/day × 90 = 9 GB ~$0.21
S3-IA storage (90–365 days) ~3 GB entering/month × 9 mo avg ~$0.11
S3 PUT requests (drain writes) ~2K batches/day × 30 ~$0.003
Lambda invocations 2K/day × 30 $0.00 (free tier)
Lambda duration 128 MB × 0.5s avg × 60K/mo $0.00 (free tier)
CloudWatch Logs ~10 MB/mo Lambda logs ~$0.01
Total estimated pre-launch scale ~$0.33/month

Volume basis: pre-launch. At production scale (DAU in thousands), log volume grows proportionally but remains well under $10/mo unless sustained at

1 GB/day of log throughput.


Scope boundary

This stack intentionally covers S3 retention only.

NOT in scope for this stack: - Athena / Glue (query layer) — deferred to future Splunk setup (#future) - Log-based alerting — see #3448 (paging/alerting vendor) - Cloudflare Pages log forwarding — separate follow-up card under #3438 - Log format normalization / structured logging — application-level concern


Escalation

Escalate to Kristerpher (ops@raxx.app) when: - S3 objects are absent for more than 2 hours and Heroku apps are confirmed active - Lambda error rate exceeds 1% sustained for 30+ minutes - S3 bucket shows unexpected object deletion or is inaccessible - A cost alert fires for raxx-heroku-logs-prod exceeding $20/mo