Heroku log drain → S3 runbook
System: Heroku log drain → Lambda → S3 (raxx-heroku-logs-prod)
Owner: sre-agent / operator
Last reviewed: 2026-06-19
Issue: #3450 (scope: S3 retention only — no Athena, no Glue)
IaC: terraform/log-drain/
Lambda: lambdas/log-drain-receiver/handler.py
Wire script: scripts/ops/wire-heroku-log-drain.sh
Overview
All Heroku application log output (stdout + stderr) from production apps drains to S3 via Heroku's HTTPS Logplex drain feature.
Heroku Logplex → HTTPS drain → Lambda Function URL → gzip object → S3
A small Lambda function (raxx-heroku-log-drain-receiver) receives each
POST batch from Heroku, validates a bearer token, gzip-compresses the batch,
and writes it to S3 with a time-partitioned key.
Query/search layer: deferred to a future Splunk setup. S3 objects are immediately accessible via direct S3 GET or the AWS console. No Athena, Glue, or other query service is configured by this stack.
Apps draining
| Heroku app | Surface |
|---|---|
raxx-api-prod |
Raptor backend (API) |
raxx-console-prod |
Console ops app |
raxx-velvet-prod |
Velvet credential rotation service |
Key resources
| Resource | Identifier |
|---|---|
| S3 bucket | raxx-heroku-logs-prod (us-east-1) |
| S3 key layout | logs/<app>/<YYYY>/<MM>/<DD>/<epoch_ms>-<uuid>.gz |
| Lambda function | raxx-heroku-log-drain-receiver (us-east-1) |
| Lambda IAM role | raxx-heroku-log-drain-receiver |
| SSM path prefix | /raxx/log-drain/ |
SSM parameters
| Name | Type | Contents |
|---|---|---|
/raxx/log-drain/bucket_name |
String | raxx-heroku-logs-prod |
/raxx/log-drain/bucket_arn |
String | arn:aws:s3:::raxx-heroku-logs-prod |
/raxx/log-drain/drain_token |
SecureString | Bearer token validated by Lambda |
/raxx/log-drain/drain_url |
SecureString | Full drain URL (Lambda URL + ?token=) |
Lifecycle
| Age | Tier | Behavior |
|---|---|---|
| 0–90 days | S3 Standard | Directly readable |
| 90–365 days | S3-IA | Directly readable (lower cost, higher per-GET) |
| > 365 days | Deleted | No further retention |
S3-IA chosen over Glacier so logs remain directly accessible without a restore step (compatible with future Splunk ingestion).
How to tell it's broken
heroku drains --app raxx-api-prodreturns an empty list or no drains are shown- Expected S3 objects for today are absent after 10+ minutes of app activity:
bash aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/$(date -u +%Y/%m/%d)/ \ --region us-east-1 - Lambda error rate is elevated (check CloudWatch Logs:
/aws/lambda/raxx-heroku-log-drain-receiver) - Heroku shows H14 (no web dynos) but logs aren't reaching S3
How to diagnose (in order)
-
Check Heroku drain registration for all three apps:
bash heroku drains --app raxx-api-prod --json heroku drains --app raxx-console-prod --json heroku drains --app raxx-velvet-prod --jsonExpected: each returns at least one drain entry with a URL matching the Lambda Function URL domain (*.lambda-url.us-east-1.on.aws). -
Check S3 for recent objects (last 10 minutes worth):
bash DATE=$(date -u +%Y/%m/%d) aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/${DATE}/ \ --region us-east-1 | tail -5Expected: at least a few objects if the app has had any traffic. -
Check Lambda logs for errors:
bash aws logs tail /aws/lambda/raxx-heroku-log-drain-receiver \ --since 30m --region us-east-1Look fordrain auth failed(wrong token),drain write failed(S3 issue), or Python tracebacks. -
Verify the drain URL is still valid in SSM:
bash # Check that the SSM parameter exists (don't print the value) aws ssm get-parameter --name /raxx/log-drain/drain_url \ --region us-east-1 --query Parameter.Name --output textExpected:/raxx/log-drain/drain_url -
Test the Lambda endpoint directly (replace with actual URL from SSM): ```bash DRAIN_URL=$(aws ssm get-parameter \ --name /raxx/log-drain/drain_url \ --with-decryption \ --query Parameter.Value \ --output text \ --region us-east-1)
# Send a synthetic syslog batch
curl -s -o /dev/null -w "%{http_code}" \
-X POST "$DRAIN_URL" \
-H "Content-Type: application/logplex-1" \
--data-binary "83 <158>1 $(date -u +%Y-%m-%dT%H:%M:%S.000000+00:00) host raxx-api-prod web.1 - synthetic test line"
``
Expected:200`
Known failure modes
Failure mode A: No drains registered (fresh install / drain removed)
Symptom: heroku drains --app raxx-api-prod returns []. No S3 objects.
Cause: Terraform was applied but wire-heroku-log-drain.sh was not run;
or a drain was manually removed.
Fix (idempotent, re-registers drains):
export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export HEROKU_API_KEY=<your-key>
bash scripts/ops/wire-heroku-log-drain.sh
Verification: heroku drains --app raxx-api-prod returns a drain URL;
S3 objects appear within 5 minutes.
Failure mode B: Lambda returns 403 (wrong drain token)
Symptom: curl -s -o /dev/null -w "%{http_code}" "$DRAIN_URL" returns 403.
Lambda CloudWatch Logs show drain auth failed: token_present=True.
Cause: The DRAIN_TOKEN environment variable on the Lambda does not match
the ?token= value in the drain URL — this can happen if the token was
regenerated (terraform taint + apply) without re-registering the Heroku drains.
Fix:
# 1. Re-apply Terraform (regenerates token, updates Lambda env var + SSM)
cd terraform/log-drain
terraform taint random_password.drain_token
terraform apply
# 2. Re-register Heroku drains with new URL
bash scripts/ops/wire-heroku-log-drain.sh
# 3. Remove the old drain (different URL) from each app
# Get current drain IDs first:
heroku drains --app raxx-api-prod --json \
| python3 -c "import json,sys; [print(d['id'], d['url'][:60]) for d in json.load(sys.stdin)]"
# Remove old drain by ID:
heroku drains:remove <old-drain-id> --app raxx-api-prod
Verification: curl returns 200; S3 objects appear within 5 minutes.
Failure mode C: Lambda cannot write to S3 (IAM)
Symptom: Lambda CloudWatch Logs show drain write failed with
AccessDenied or NoSuchBucket.
Cause: IAM role policy for raxx-heroku-log-drain-receiver is missing or
the bucket was accidentally deleted/renamed.
Diagnosis:
# Confirm Lambda role policy via AWS console or:
aws iam get-role-policy \
--role-name raxx-heroku-log-drain-receiver \
--policy-name raxx-heroku-log-drain-receiver-policy \
--region us-east-1
# Confirm bucket exists:
aws s3 ls s3://raxx-heroku-logs-prod/ --region us-east-1
Fix: Re-apply terraform/log-drain to restore the IAM policy or bucket.
If the bucket was deleted, it must be recreated (lifecycle data is lost — a
Terraform prevent_destroy = true guard should prevent accidental deletion).
Failure mode D: Lambda cold-start timeouts (rare)
Symptom: Lambda CloudWatch Logs show Task timed out after 15.00 seconds.
Heroku may retry the drain batch.
Cause: Extremely large log batch or S3 latency spike. Lambda timeout is 15 seconds (Heroku drain timeout is 10 seconds).
Fix: Lambda retries are automatic. If this is sustained, increase Lambda
memory_size (more memory = more CPU = faster execution) in
terraform/log-drain/variables.tf and re-apply.
Provisioning (first-time / recovery)
Prerequisites
- AWS CLI with credentials that have admin access (operator workstation or
Kristerpher's personal AWS credentials — NOT
claude-infisical-bootstrapwhich has narrow Infisical-bootstrap permissions only) - Heroku CLI authenticated with
HEROKU_API_KEYfor prod apps - Terraform >= 1.5.0
Step 1: Apply Terraform
cd terraform/log-drain
terraform init
terraform plan -out=tfplan
# Review: expect ~14 new resources (S3 bucket + versioning + encryption +
# public-access-block + lifecycle, IAM role + policy, Lambda + function URL,
# SSM params x4, random_password x1)
terraform apply tfplan
Note: The claude-infisical-bootstrap IAM user cannot apply this stack
(it lacks s3:CreateBucket, iam:CreateRole, lambda:CreateFunction).
Kristerpher must apply from his workstation using his AWS admin credentials.
Step 2: Register Heroku drains
export HEROKU_API_KEY=<prod-api-key-from-vault>
bash scripts/ops/wire-heroku-log-drain.sh
Step 3: Verify
# Drain registration
heroku drains --app raxx-api-prod
heroku drains --app raxx-console-prod
heroku drains --app raxx-velvet-prod
# Wait 5 minutes, then check for S3 objects
aws s3 ls s3://raxx-heroku-logs-prod/logs/ --region us-east-1 --recursive | head -10
Step 4: Smoke-test the Lambda
DRAIN_URL=$(aws ssm get-parameter \
--name /raxx/log-drain/drain_url \
--with-decryption \
--query Parameter.Value \
--output text \
--region us-east-1)
curl -s -o /dev/null -w "HTTP %{http_code}\n" \
-X POST "$DRAIN_URL" \
-H "Content-Type: application/logplex-1" \
--data-binary "83 <158>1 $(date -u +%Y-%m-%dT%H:%M:%S.000000+00:00) host raxx-api-prod web.1 - smoke test from runbook"
# Expected: HTTP 200
# Then verify the object landed in S3
aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/$(date -u +%Y/%m/%d)/ \
--region us-east-1 | tail -3
Retrieving logs from S3
S3 objects are gzip-compressed, one per Heroku logplex batch (typically 10–100 log lines per object). To read them:
# List objects for a date
aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/ \
--region us-east-1
# Download and decompress a single object
aws s3 cp s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/<key>.gz - \
--region us-east-1 | gunzip | head -50
# Search across a day's logs with grep
for key in $(aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/ \
--region us-east-1 | awk '{print $4}'); do
aws s3 cp "s3://raxx-heroku-logs-prod/logs/raxx-api-prod/2026/06/19/$key" - \
--region us-east-1 | gunzip | grep "status=5" || true
done
Future query layer: Splunk will be configured to ingest from S3 directly (#future). Until then, use the direct S3 grep pattern above or the AWS console S3 Select feature for ad-hoc searches.
Token rotation
If the drain token must be rotated (suspected leak):
cd terraform/log-drain
# 1. Force Terraform to regenerate the random token
terraform taint random_password.drain_token
# 2. Apply — this updates the Lambda env var + SSM params atomically
terraform apply -target=random_password.drain_token \
-target=aws_lambda_function.drain_receiver \
-target=aws_ssm_parameter.drain_token \
-target=aws_ssm_parameter.drain_url
# 3. Remove old drains and re-register with new URL
bash scripts/ops/wire-heroku-log-drain.sh
# 4. Remove old drain registrations (different URL)
for APP in raxx-api-prod raxx-console-prod raxx-velvet-prod; do
OLD_IDS=$(heroku drains --app "$APP" --json \
| python3 -c "
import json,sys,re
drains = json.load(sys.stdin)
new_host_pattern = 'lambda-url' # matches Lambda Function URL domain
for d in drains:
if new_host_pattern not in d.get('url',''):
print(d['id'])
")
for ID in $OLD_IDS; do
echo "Removing old drain $ID from $APP"
heroku drains:remove "$ID" --app "$APP"
done
done
Emergency stop — disable all drains
To stop Heroku from sending logs to the endpoint (e.g., endpoint misbehaving):
for APP in raxx-api-prod raxx-console-prod raxx-velvet-prod; do
IDS=$(heroku drains --app "$APP" --json \
| python3 -c "import json,sys; [print(d['id']) for d in json.load(sys.stdin)]")
for ID in $IDS; do
echo "Removing drain $ID from $APP"
heroku drains:remove "$ID" --app "$APP"
done
done
To re-register:
bash scripts/ops/wire-heroku-log-drain.sh
Cost estimate
| Resource | Volume assumption | $/month |
|---|---|---|
| S3 Standard storage (0–90 days) | ~100 MB/day × 90 = 9 GB | ~$0.21 |
| S3-IA storage (90–365 days) | ~3 GB entering/month × 9 mo avg | ~$0.11 |
| S3 PUT requests (drain writes) | ~2K batches/day × 30 | ~$0.003 |
| Lambda invocations | 2K/day × 30 | $0.00 (free tier) |
| Lambda duration | 128 MB × 0.5s avg × 60K/mo | $0.00 (free tier) |
| CloudWatch Logs | ~10 MB/mo Lambda logs | ~$0.01 |
| Total estimated | pre-launch scale | ~$0.33/month |
Volume basis: pre-launch. At production scale (DAU in thousands), log volume grows proportionally but remains well under $10/mo unless sustained at
1 GB/day of log throughput.
Scope boundary
This stack intentionally covers S3 retention only.
NOT in scope for this stack: - Athena / Glue (query layer) — deferred to future Splunk setup (#future) - Log-based alerting — see #3448 (paging/alerting vendor) - Cloudflare Pages log forwarding — separate follow-up card under #3438 - Log format normalization / structured logging — application-level concern
Escalation
Escalate to Kristerpher (ops@raxx.app) when:
- S3 objects are absent for more than 2 hours and Heroku apps are confirmed active
- Lambda error rate exceeds 1% sustained for 30+ minutes
- S3 bucket shows unexpected object deletion or is inaccessible
- A cost alert fires for raxx-heroku-logs-prod exceeding $20/mo