Heroku log drain → S3 + Athena runbook
System: Heroku log drain → S3 (raxx-heroku-logs-prod) + AWS Athena
Owner: sre-agent / operator
Last incident: n/a (new stack, 2026-06-17)
Last reviewed: 2026-06-17
Issue: #3450
Related: terraform/log-drain/, scripts/ops/wire-heroku-log-drain.sh
Overview
All Heroku application log output (stdout + stderr) from raxx-api-prod,
raxx-console-prod, and raxx-velvet-prod drains to the S3 bucket
raxx-heroku-logs-prod via Heroku's HTTPS log drain feature.
Logs are stored in RFC-5424 syslog format, compressed (gzip), and partitioned
by app/year/month/day. AWS Athena queries the bucket via a Glue catalog
table. There is no real-time log UI — all historical search is ad-hoc via
Athena (or direct S3 access for single-object inspection).
Key resources
| Resource | Identifier |
|---|---|
| S3 bucket | raxx-heroku-logs-prod (us-east-1) |
| Glue database | raxx_heroku_logs |
| Glue table | syslog |
| Athena workgroup | raxx-log-queries |
| IAM role (read) | raxx-log-athena-reader |
| IAM user (write) | raxx-heroku-log-drain |
| SSM path prefix | /raxx/log-drain/ |
| Heroku apps drained | raxx-api-prod, raxx-console-prod, raxx-velvet-prod |
Lifecycle
| Age | Tier | Notes |
|---|---|---|
| 0–90 days | S3 Standard | Immediately queryable by Athena |
| 90–365 days | S3 Glacier Flexible Retrieval | Requires restore before Athena query |
| > 365 days | Deleted | No further retention |
Athena result objects in the athena-results/ prefix expire after 30 days.
How to tell the drain is broken
heroku drains --app raxx-api-prodreturns an empty list or drain showshealthy: false- Expected S3 objects for today are absent:
bash aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/$(date -u +%Y)/$(date -u +%m)/$(date -u +%d)/ \ --region us-east-1Expected: at least a few objects if the app has had any traffic. - Athena queries return 0 rows for today even though the app has been active.
- Heroku shows H14 (no web dynos) but logs aren't reaching S3.
How to diagnose (in order)
-
Check Heroku drain registration for all three apps:
bash heroku drains --app raxx-api-prod --json heroku drains --app raxx-console-prod --json heroku drains --app raxx-velvet-prod --jsonExpected: each returns at least one drain with the S3 endpoint URL. -
Check S3 for recent objects (last hour):
bash DATE=$(date -u +%Y/%m/%d) aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/${DATE}/ \ --region us-east-1 | tail -5 -
Check drain token is still valid in SSM:
bash aws ssm get-parameter --name /raxx/log-drain/drain_url \ --with-decryption --query Parameter.Value --output text --region us-east-1The URL should match whatheroku drainsshows. -
Check IAM write permissions for the drain writer user:
bash aws iam simulate-principal-policy \ --policy-source-arn $(aws ssm get-parameter \ --name /raxx/log-drain/bucket_arn \ --query Parameter.Value --output text --region us-east-1 \ | sed 's|arn:aws:s3:::.*|arn:aws:iam::<account-id>:user/raxx-heroku-log-drain|') \ --action-names s3:PutObject \ --resource-arns "arn:aws:s3:::raxx-heroku-logs-prod/logs/*"
Athena query patterns
All queries use the Athena workgroup raxx-log-queries. Results land in
s3://raxx-heroku-logs-prod/athena-results/.
Assume the Athena reader role before running queries if authenticating from CLI:
ROLE_ARN=$(aws ssm get-parameter \
--name /raxx/log-drain/athena_reader_role_arn \
--query Parameter.Value --output text --region us-east-1)
CREDS=$(aws sts assume-role \
--role-arn "$ROLE_ARN" \
--role-session-name "log-query-$(date +%Y%m%dT%H%M%S)" \
--output json)
export AWS_ACCESS_KEY_ID=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['AccessKeyId'])")
export AWS_SECRET_ACCESS_KEY=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['SecretAccessKey'])")
export AWS_SESSION_TOKEN=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['SessionToken'])")
Run queries:
WORKGROUP="raxx-log-queries"
DB="raxx_heroku_logs"
aws athena start-query-execution \
--query-string "SELECT ..." \
--work-group "$WORKGROUP" \
--query-execution-context Database="$DB" \
--region us-east-1
Query 1 — Errors by surface (HTTP 5xx per app per hour, last 24h)
Count HTTP 5xx responses per app per hour window. Useful for spotting post-deploy error spikes or correlated multi-surface failures.
-- Query 1: HTTP 5xx count by app and hour, last 24 hours
-- Partition pruning: last 2 days (covers full 24h UTC window regardless of local time)
SELECT
app,
date_trunc('hour', from_iso8601_timestamp(log_ts)) AS hour_utc,
COUNT(*) AS error_count
FROM raxx_heroku_logs.syslog
WHERE
year IN ('2026')
AND month IN (
lpad(CAST(month(current_date) AS varchar), 2, '0'),
lpad(CAST(month(current_date - interval '1' day) AS varchar), 2, '0')
)
AND day IN (
lpad(CAST(day(current_date) AS varchar), 2, '0'),
lpad(CAST(day(current_date - interval '1' day) AS varchar), 2, '0')
)
AND message LIKE '%status=5%'
AND from_iso8601_timestamp(log_ts) >= now() - interval '24' hour
GROUP BY
1, 2
ORDER BY
hour_utc DESC,
error_count DESC;
Expected columns: app, hour_utc, error_count
Tip: Add AND app = 'raxx-api-prod' to narrow to a single surface.
Query 2 — Latency outliers (requests > 2000ms, last 7 days)
Identify slow requests across all apps. Heroku router log lines include
service=<N>ms for the upstream response time.
-- Query 2: Requests with service time > 2000ms, last 7 days
-- Extracts the service time value from the Heroku router log line
SELECT
app,
proc_id,
log_ts,
regexp_extract(message, 'method=(\S+)', 1) AS method,
regexp_extract(message, 'path=(\S+)', 1) AS path,
CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint) AS service_ms,
regexp_extract(message, 'status=(\d+)', 1) AS status,
regexp_extract(message, 'fwd="([^"]+)"', 1) AS fwd_ip
FROM raxx_heroku_logs.syslog
WHERE
year = '2026'
AND month IN (
lpad(CAST(month(current_date) AS varchar), 2, '0'),
lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
)
AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
AND message LIKE '%service=%ms%'
AND CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint) > 2000
ORDER BY
service_ms DESC
LIMIT 200;
Expected columns: app, proc_id, log_ts, method, path,
service_ms, status, fwd_ip
Note: Heroku router lines include at=info or at=error. Filter
message LIKE '%at=error%' to surface only error-state slow requests.
Query 3 — Security-relevant lines (auth failures, rate limits, IP anomalies, last 24h)
Pull authentication failures, rate-limit hits, and suspicious patterns that indicate an active attack or misconfigured client.
-- Query 3: Security-relevant log lines, last 24 hours
-- Covers: 401/403/429 responses, passkey failure events, rate-limit triggers
SELECT
app,
proc_id,
log_ts,
CASE
WHEN message LIKE '%status=401%' THEN '401-unauthorized'
WHEN message LIKE '%status=403%' THEN '403-forbidden'
WHEN message LIKE '%status=429%' THEN '429-rate-limited'
WHEN message LIKE '%webauthn.registration_verify_failed%' THEN 'passkey-register-fail'
WHEN message LIKE '%webauthn.authentication_verify_failed%' THEN 'passkey-authn-fail'
WHEN message LIKE '%rate_limit_hit%' THEN 'rate-limit-custom'
ELSE 'other-security'
END AS event_type,
regexp_extract(message, 'fwd="([^"]+)"', 1) AS fwd_ip,
regexp_extract(message, 'path=(\S+)', 1) AS path,
message
FROM raxx_heroku_logs.syslog
WHERE
year IN ('2026')
AND month IN (
lpad(CAST(month(current_date) AS varchar), 2, '0'),
lpad(CAST(month(current_date - interval '1' day) AS varchar), 2, '0')
)
AND from_iso8601_timestamp(log_ts) >= now() - interval '24' hour
AND (
message LIKE '%status=401%'
OR message LIKE '%status=403%'
OR message LIKE '%status=429%'
OR message LIKE '%webauthn.%_failed%'
OR message LIKE '%rate_limit_hit%'
)
ORDER BY
log_ts DESC
LIMIT 500;
Expected columns: app, proc_id, log_ts, event_type, fwd_ip,
path, message
IP anomaly pivot: To count events by IP:
SELECT fwd_ip, COUNT(*) AS hits
FROM ( <above query> ) t
GROUP BY 1 ORDER BY 2 DESC LIMIT 20;
Query 4 — Recent deploys (Heroku release log lines, last 7 days)
Correlate Heroku at=info Logplex "deploy" events with timestamps. Useful for
establishing a deploy timeline when triaging a post-deploy spike.
-- Query 4: Deploy events from Heroku Logplex, last 7 days
-- Heroku emits system log lines with proc_id 'heroku/release' or 'heroku/router' at deploy time.
-- Release lines contain "Deploy <slug-id>" in the message.
SELECT
app,
log_ts,
proc_id,
message
FROM raxx_heroku_logs.syslog
WHERE
year = '2026'
AND month IN (
lpad(CAST(month(current_date) AS varchar), 2, '0'),
lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
)
AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
AND (
(proc_id LIKE 'heroku/%' AND message LIKE '%Deploy%')
OR message LIKE '%State changed from%'
OR message LIKE '%Deploying%'
OR message LIKE '%release v%'
)
ORDER BY
log_ts DESC
LIMIT 100;
Expected columns: app, log_ts, proc_id, message
Cross-reference: To correlate a deploy with a Sentry error spike:
1. Note the log_ts of the deploy line.
2. Run Query 1 with a narrower window (e.g., >= <deploy_ts> and
<= <deploy_ts> + interval '1' hour).
3. Compare the per-hour error count before and after the deploy timestamp.
Query 5 — Slow requests by endpoint (avg latency per path prefix, last 7 days)
Group by path prefix to identify which endpoints contribute most to latency. Useful for capacity planning and prioritising performance work.
-- Query 5: Average and P95 service time by path prefix, last 7 days
-- Path prefix: first two segments (e.g., /api/auth, /api/trades, /api/beta)
SELECT
app,
regexp_replace(
regexp_extract(message, 'path=(\S+)', 1),
'^(/[^/?#]*/[^/?#]*).*$', '\1'
) AS path_prefix,
COUNT(*) AS request_count,
ROUND(AVG(
CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS double)
), 1) AS avg_service_ms,
ROUND(approx_percentile(
CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS double), 0.95
), 1) AS p95_service_ms,
MAX(
CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint)
) AS max_service_ms
FROM raxx_heroku_logs.syslog
WHERE
year = '2026'
AND month IN (
lpad(CAST(month(current_date) AS varchar), 2, '0'),
lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
)
AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
AND message LIKE '%service=%ms%'
AND message LIKE '%method=%'
AND regexp_extract(message, 'service=(\d+)ms', 1) != ''
GROUP BY
1, 2
HAVING
COUNT(*) >= 10 -- suppress one-off paths
ORDER BY
avg_service_ms DESC
LIMIT 50;
Expected columns: app, path_prefix, request_count, avg_service_ms,
p95_service_ms, max_service_ms
Tip: Add AND app = 'raxx-api-prod' to narrow to the Raptor API surface.
Apply runbook (first-time provisioning)
-
Ensure Terraform state backend is accessible:
bash aws s3 ls s3://raxx-iac-state-prod/ --region us-east-1 -
Resolve the ops break-glass ARN:
bash export TF_VAR_ops_break_glass_arn=$(aws iam get-user \ --user-name claude-infisical-bootstrap \ --query User.Arn --output text) -
Init and apply:
bash cd terraform/log-drain terraform init terraform plan -out=tfplan # Review plan: expect ~15 new resources (S3, lifecycle, IAM, SSM, Athena) terraform apply tfplan -
Deploy the drain receiver endpoint (see Bootstrap section below).
-
Wire Heroku drains and Glue catalog:
bash bash scripts/ops/wire-heroku-log-drain.sh -
Verify:
bash heroku drains --app raxx-api-prod heroku drains --app raxx-console-prod heroku drains --app raxx-velvet-prod
Bootstrap: drain receiver endpoint
Heroku HTTPS log drains POST RFC-5424 syslog batches to an HTTPS URL that must return HTTP 200. The simplest production-ready approach is an API Gateway + Lambda function.
Lambda receiver (recommended)
A minimal receiver Lambda at lambdas/log-drain-receiver/ (to be created as a
follow-up under #3450):
- Receives
POST /drainwith aContent-Type: application/logplex-1body. - Splits the batch into individual log lines.
- Writes a gzip-compressed object to S3 at:
logs/<app_name>/<YYYY>/<MM>/<DD>/<epoch_ms>-<uuid>.gz - Returns
200 OK.
Until the Lambda is deployed, wire-heroku-log-drain.sh will halt with
instructions. Manual bootstrap alternative:
# Set drain_url SSM parameter manually after deploying the Lambda:
aws ssm put-parameter \
--name /raxx/log-drain/drain_url \
--type SecureString \
--value "https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/drain?token=<drain-token>" \
--region us-east-1
# Then re-run:
bash scripts/ops/wire-heroku-log-drain.sh
The drain URL includes a bearer token (token= query param) that matches a
value the Lambda validates before writing to S3. This prevents the endpoint
from being abused as an open S3 write proxy.
Known limitations
Unstructured log lines
Raptor logs some lines that are not in the standard Heroku router format
(e.g., raw Python tracebacks, multi-line exception output). The regex SerDe
places the entire unstructured line in the message column and leaves other
columns as empty strings. Athena queries over non-structured lines still work
for substring matching (message LIKE '%...') but structured field extraction
(regexp_extract) will return empty strings.
To identify the proportion of unstructured lines:
SELECT
COUNT(*) AS total,
SUM(CASE WHEN method IS NULL OR method = '' THEN 1 ELSE 0 END) AS unstructured_count
FROM raxx_heroku_logs.syslog
WHERE year = '2026' AND month = '06' AND day = '17'
LIMIT 1;
Glacier query cost
Objects older than 90 days are in S3 Glacier Flexible Retrieval. Athena cannot query them directly. To query archived logs:
# Restore objects for the desired date range (expedited = 1–5 min, Tier Standard = 3–5h)
aws s3api restore-object \
--bucket raxx-heroku-logs-prod \
--key "logs/raxx-api-prod/2026/03/01/<object-key>.gz" \
--restore-request '{"Days":2,"GlacierJobParameters":{"Tier":"Standard"}}'
Wait for restoration, then re-run Athena queries. Restored objects revert to
Glacier after the specified Days.
Partition projection — no MSCK REPAIR TABLE needed
The Glue table uses Athena Partition Projection. New partitions (new dates, new
apps) are inferred automatically from the S3 key structure. No MSCK REPAIR
TABLE or manual partition registration is needed.
If a new Heroku app must be added to the drain, update the projection.app.values
parameter in the Glue table definition:
aws glue update-table \
--database-name raxx_heroku_logs \
--table-input '{ ... "Parameters": { "projection.app.values": "raxx-api-prod,raxx-console-prod,raxx-velvet-prod,<new-app>" ... } }' \
--region us-east-1
And register the drain on the new app:
DRAIN_URL=$(aws ssm get-parameter --name /raxx/log-drain/drain_url \
--with-decryption --query Parameter.Value --output text --region us-east-1)
heroku drains:add "$DRAIN_URL" --app <new-app>
Emergency stop
To stop all log drains (e.g., if the drain endpoint is misbehaving and flooding S3 with bad data):
# List drain IDs
heroku drains --app raxx-api-prod --json \
| python3 -c "import json,sys; [print(d['id'], d['url'][:50]) for d in json.load(sys.stdin)]"
# Remove by drain ID (not URL)
heroku drains:remove <drain-id> --app raxx-api-prod
heroku drains:remove <drain-id> --app raxx-console-prod
heroku drains:remove <drain-id> --app raxx-velvet-prod
To re-register:
bash scripts/ops/wire-heroku-log-drain.sh
Cost estimate
| Resource | Volume assumption | $/month |
|---|---|---|
| S3 Standard storage (0–90 days) | ~500 MB/day × 90 days = 45 GB | ~$1.04 |
| S3 Glacier storage (90–365 days) | ~15 GB entering/month × 9 months avg | ~$0.54 |
| S3 PUT requests (drain writes) | ~10K objects/day | ~$0.05 |
| S3 GET requests (Athena reads) | ~50K scans/month | ~$0.05 |
| Athena scan cost | ~5 GB scanned/month | ~$0.12 |
| Athena result storage (30-day TTL) | ~10 MB | ~$0.00 |
| Glue metadata catalog | Partition projection (no crawler) | $0.00 |
| IAM + SSM | n/a | $0.00 |
| Total estimated | ~$1.80/month |
Volume basis: pre-launch traffic. At production scale (thousands of daily active users), log volume grows proportionally but remains well under $50/mo unless sustained at >5 GB/day log throughput. Lifecycle to Glacier caps long-term storage cost.
Escalation
Escalate to Kristerpher (ops@raxx.app) when: - Drain objects are absent from S3 for more than 2 hours and the Heroku app is confirmed active (dynos up, traffic flowing). - Athena queries return persistent errors unrelated to partition scope. - The S3 bucket is unexpectedly empty or objects are missing (possible exfiltration). - A cost alert fires for the S3 bucket exceeding $20/mo (indicates unexpected volume or a runaway write loop).