Raxx · internal docs

internal · gated

Heroku log drain → S3 + Athena runbook

System: Heroku log drain → S3 (raxx-heroku-logs-prod) + AWS Athena Owner: sre-agent / operator Last incident: n/a (new stack, 2026-06-17) Last reviewed: 2026-06-17 Issue: #3450 Related: terraform/log-drain/, scripts/ops/wire-heroku-log-drain.sh


Overview

All Heroku application log output (stdout + stderr) from raxx-api-prod, raxx-console-prod, and raxx-velvet-prod drains to the S3 bucket raxx-heroku-logs-prod via Heroku's HTTPS log drain feature.

Logs are stored in RFC-5424 syslog format, compressed (gzip), and partitioned by app/year/month/day. AWS Athena queries the bucket via a Glue catalog table. There is no real-time log UI — all historical search is ad-hoc via Athena (or direct S3 access for single-object inspection).

Key resources

Resource Identifier
S3 bucket raxx-heroku-logs-prod (us-east-1)
Glue database raxx_heroku_logs
Glue table syslog
Athena workgroup raxx-log-queries
IAM role (read) raxx-log-athena-reader
IAM user (write) raxx-heroku-log-drain
SSM path prefix /raxx/log-drain/
Heroku apps drained raxx-api-prod, raxx-console-prod, raxx-velvet-prod

Lifecycle

Age Tier Notes
0–90 days S3 Standard Immediately queryable by Athena
90–365 days S3 Glacier Flexible Retrieval Requires restore before Athena query
> 365 days Deleted No further retention

Athena result objects in the athena-results/ prefix expire after 30 days.


How to tell the drain is broken


How to diagnose (in order)

  1. Check Heroku drain registration for all three apps: bash heroku drains --app raxx-api-prod --json heroku drains --app raxx-console-prod --json heroku drains --app raxx-velvet-prod --json Expected: each returns at least one drain with the S3 endpoint URL.

  2. Check S3 for recent objects (last hour): bash DATE=$(date -u +%Y/%m/%d) aws s3 ls s3://raxx-heroku-logs-prod/logs/raxx-api-prod/${DATE}/ \ --region us-east-1 | tail -5

  3. Check drain token is still valid in SSM: bash aws ssm get-parameter --name /raxx/log-drain/drain_url \ --with-decryption --query Parameter.Value --output text --region us-east-1 The URL should match what heroku drains shows.

  4. Check IAM write permissions for the drain writer user: bash aws iam simulate-principal-policy \ --policy-source-arn $(aws ssm get-parameter \ --name /raxx/log-drain/bucket_arn \ --query Parameter.Value --output text --region us-east-1 \ | sed 's|arn:aws:s3:::.*|arn:aws:iam::<account-id>:user/raxx-heroku-log-drain|') \ --action-names s3:PutObject \ --resource-arns "arn:aws:s3:::raxx-heroku-logs-prod/logs/*"


Athena query patterns

All queries use the Athena workgroup raxx-log-queries. Results land in s3://raxx-heroku-logs-prod/athena-results/.

Assume the Athena reader role before running queries if authenticating from CLI:

ROLE_ARN=$(aws ssm get-parameter \
  --name /raxx/log-drain/athena_reader_role_arn \
  --query Parameter.Value --output text --region us-east-1)

CREDS=$(aws sts assume-role \
  --role-arn "$ROLE_ARN" \
  --role-session-name "log-query-$(date +%Y%m%dT%H%M%S)" \
  --output json)

export AWS_ACCESS_KEY_ID=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['AccessKeyId'])")
export AWS_SECRET_ACCESS_KEY=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['SecretAccessKey'])")
export AWS_SESSION_TOKEN=$(echo "$CREDS" | python3 -c "import json,sys; print(json.load(sys.stdin)['Credentials']['SessionToken'])")

Run queries:

WORKGROUP="raxx-log-queries"
DB="raxx_heroku_logs"

aws athena start-query-execution \
  --query-string "SELECT ..." \
  --work-group "$WORKGROUP" \
  --query-execution-context Database="$DB" \
  --region us-east-1

Query 1 — Errors by surface (HTTP 5xx per app per hour, last 24h)

Count HTTP 5xx responses per app per hour window. Useful for spotting post-deploy error spikes or correlated multi-surface failures.

-- Query 1: HTTP 5xx count by app and hour, last 24 hours
-- Partition pruning: last 2 days (covers full 24h UTC window regardless of local time)

SELECT
    app,
    date_trunc('hour', from_iso8601_timestamp(log_ts)) AS hour_utc,
    COUNT(*) AS error_count
FROM raxx_heroku_logs.syslog
WHERE
    year IN ('2026')
    AND month IN (
        lpad(CAST(month(current_date) AS varchar), 2, '0'),
        lpad(CAST(month(current_date - interval '1' day) AS varchar), 2, '0')
    )
    AND day IN (
        lpad(CAST(day(current_date) AS varchar), 2, '0'),
        lpad(CAST(day(current_date - interval '1' day) AS varchar), 2, '0')
    )
    AND message LIKE '%status=5%'
    AND from_iso8601_timestamp(log_ts) >= now() - interval '24' hour
GROUP BY
    1, 2
ORDER BY
    hour_utc DESC,
    error_count DESC;

Expected columns: app, hour_utc, error_count

Tip: Add AND app = 'raxx-api-prod' to narrow to a single surface.


Query 2 — Latency outliers (requests > 2000ms, last 7 days)

Identify slow requests across all apps. Heroku router log lines include service=<N>ms for the upstream response time.

-- Query 2: Requests with service time > 2000ms, last 7 days
-- Extracts the service time value from the Heroku router log line

SELECT
    app,
    proc_id,
    log_ts,
    regexp_extract(message, 'method=(\S+)', 1)                AS method,
    regexp_extract(message, 'path=(\S+)', 1)                  AS path,
    CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint) AS service_ms,
    regexp_extract(message, 'status=(\d+)', 1)                AS status,
    regexp_extract(message, 'fwd="([^"]+)"', 1)               AS fwd_ip
FROM raxx_heroku_logs.syslog
WHERE
    year = '2026'
    AND month IN (
        lpad(CAST(month(current_date) AS varchar), 2, '0'),
        lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
    )
    AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
    AND message LIKE '%service=%ms%'
    AND CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint) > 2000
ORDER BY
    service_ms DESC
LIMIT 200;

Expected columns: app, proc_id, log_ts, method, path, service_ms, status, fwd_ip

Note: Heroku router lines include at=info or at=error. Filter message LIKE '%at=error%' to surface only error-state slow requests.


Query 3 — Security-relevant lines (auth failures, rate limits, IP anomalies, last 24h)

Pull authentication failures, rate-limit hits, and suspicious patterns that indicate an active attack or misconfigured client.

-- Query 3: Security-relevant log lines, last 24 hours
-- Covers: 401/403/429 responses, passkey failure events, rate-limit triggers

SELECT
    app,
    proc_id,
    log_ts,
    CASE
        WHEN message LIKE '%status=401%' THEN '401-unauthorized'
        WHEN message LIKE '%status=403%' THEN '403-forbidden'
        WHEN message LIKE '%status=429%' THEN '429-rate-limited'
        WHEN message LIKE '%webauthn.registration_verify_failed%' THEN 'passkey-register-fail'
        WHEN message LIKE '%webauthn.authentication_verify_failed%' THEN 'passkey-authn-fail'
        WHEN message LIKE '%rate_limit_hit%' THEN 'rate-limit-custom'
        ELSE 'other-security'
    END AS event_type,
    regexp_extract(message, 'fwd="([^"]+)"', 1) AS fwd_ip,
    regexp_extract(message, 'path=(\S+)', 1)    AS path,
    message
FROM raxx_heroku_logs.syslog
WHERE
    year IN ('2026')
    AND month IN (
        lpad(CAST(month(current_date) AS varchar), 2, '0'),
        lpad(CAST(month(current_date - interval '1' day) AS varchar), 2, '0')
    )
    AND from_iso8601_timestamp(log_ts) >= now() - interval '24' hour
    AND (
           message LIKE '%status=401%'
        OR message LIKE '%status=403%'
        OR message LIKE '%status=429%'
        OR message LIKE '%webauthn.%_failed%'
        OR message LIKE '%rate_limit_hit%'
    )
ORDER BY
    log_ts DESC
LIMIT 500;

Expected columns: app, proc_id, log_ts, event_type, fwd_ip, path, message

IP anomaly pivot: To count events by IP:

SELECT fwd_ip, COUNT(*) AS hits
FROM ( <above query> ) t
GROUP BY 1 ORDER BY 2 DESC LIMIT 20;

Query 4 — Recent deploys (Heroku release log lines, last 7 days)

Correlate Heroku at=info Logplex "deploy" events with timestamps. Useful for establishing a deploy timeline when triaging a post-deploy spike.

-- Query 4: Deploy events from Heroku Logplex, last 7 days
-- Heroku emits system log lines with proc_id 'heroku/release' or 'heroku/router' at deploy time.
-- Release lines contain "Deploy <slug-id>" in the message.

SELECT
    app,
    log_ts,
    proc_id,
    message
FROM raxx_heroku_logs.syslog
WHERE
    year = '2026'
    AND month IN (
        lpad(CAST(month(current_date) AS varchar), 2, '0'),
        lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
    )
    AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
    AND (
           (proc_id LIKE 'heroku/%' AND message LIKE '%Deploy%')
        OR message LIKE '%State changed from%'
        OR message LIKE '%Deploying%'
        OR message LIKE '%release v%'
    )
ORDER BY
    log_ts DESC
LIMIT 100;

Expected columns: app, log_ts, proc_id, message

Cross-reference: To correlate a deploy with a Sentry error spike: 1. Note the log_ts of the deploy line. 2. Run Query 1 with a narrower window (e.g., >= <deploy_ts> and <= <deploy_ts> + interval '1' hour). 3. Compare the per-hour error count before and after the deploy timestamp.


Query 5 — Slow requests by endpoint (avg latency per path prefix, last 7 days)

Group by path prefix to identify which endpoints contribute most to latency. Useful for capacity planning and prioritising performance work.

-- Query 5: Average and P95 service time by path prefix, last 7 days
-- Path prefix: first two segments (e.g., /api/auth, /api/trades, /api/beta)

SELECT
    app,
    regexp_replace(
        regexp_extract(message, 'path=(\S+)', 1),
        '^(/[^/?#]*/[^/?#]*).*$', '\1'
    )                                                              AS path_prefix,
    COUNT(*)                                                        AS request_count,
    ROUND(AVG(
        CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS double)
    ), 1)                                                           AS avg_service_ms,
    ROUND(approx_percentile(
        CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS double), 0.95
    ), 1)                                                           AS p95_service_ms,
    MAX(
        CAST(regexp_extract(message, 'service=(\d+)ms', 1) AS bigint)
    )                                                               AS max_service_ms
FROM raxx_heroku_logs.syslog
WHERE
    year = '2026'
    AND month IN (
        lpad(CAST(month(current_date) AS varchar), 2, '0'),
        lpad(CAST(month(current_date - interval '6' day) AS varchar), 2, '0')
    )
    AND from_iso8601_timestamp(log_ts) >= now() - interval '7' day
    AND message LIKE '%service=%ms%'
    AND message LIKE '%method=%'
    AND regexp_extract(message, 'service=(\d+)ms', 1) != ''
GROUP BY
    1, 2
HAVING
    COUNT(*) >= 10      -- suppress one-off paths
ORDER BY
    avg_service_ms DESC
LIMIT 50;

Expected columns: app, path_prefix, request_count, avg_service_ms, p95_service_ms, max_service_ms

Tip: Add AND app = 'raxx-api-prod' to narrow to the Raptor API surface.


Apply runbook (first-time provisioning)

  1. Ensure Terraform state backend is accessible: bash aws s3 ls s3://raxx-iac-state-prod/ --region us-east-1

  2. Resolve the ops break-glass ARN: bash export TF_VAR_ops_break_glass_arn=$(aws iam get-user \ --user-name claude-infisical-bootstrap \ --query User.Arn --output text)

  3. Init and apply: bash cd terraform/log-drain terraform init terraform plan -out=tfplan # Review plan: expect ~15 new resources (S3, lifecycle, IAM, SSM, Athena) terraform apply tfplan

  4. Deploy the drain receiver endpoint (see Bootstrap section below).

  5. Wire Heroku drains and Glue catalog: bash bash scripts/ops/wire-heroku-log-drain.sh

  6. Verify: bash heroku drains --app raxx-api-prod heroku drains --app raxx-console-prod heroku drains --app raxx-velvet-prod


Bootstrap: drain receiver endpoint

Heroku HTTPS log drains POST RFC-5424 syslog batches to an HTTPS URL that must return HTTP 200. The simplest production-ready approach is an API Gateway + Lambda function.

A minimal receiver Lambda at lambdas/log-drain-receiver/ (to be created as a follow-up under #3450):

  1. Receives POST /drain with a Content-Type: application/logplex-1 body.
  2. Splits the batch into individual log lines.
  3. Writes a gzip-compressed object to S3 at: logs/<app_name>/<YYYY>/<MM>/<DD>/<epoch_ms>-<uuid>.gz
  4. Returns 200 OK.

Until the Lambda is deployed, wire-heroku-log-drain.sh will halt with instructions. Manual bootstrap alternative:

# Set drain_url SSM parameter manually after deploying the Lambda:
aws ssm put-parameter \
  --name /raxx/log-drain/drain_url \
  --type SecureString \
  --value "https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/drain?token=<drain-token>" \
  --region us-east-1

# Then re-run:
bash scripts/ops/wire-heroku-log-drain.sh

The drain URL includes a bearer token (token= query param) that matches a value the Lambda validates before writing to S3. This prevents the endpoint from being abused as an open S3 write proxy.


Known limitations

Unstructured log lines

Raptor logs some lines that are not in the standard Heroku router format (e.g., raw Python tracebacks, multi-line exception output). The regex SerDe places the entire unstructured line in the message column and leaves other columns as empty strings. Athena queries over non-structured lines still work for substring matching (message LIKE '%...') but structured field extraction (regexp_extract) will return empty strings.

To identify the proportion of unstructured lines:

SELECT
    COUNT(*) AS total,
    SUM(CASE WHEN method IS NULL OR method = '' THEN 1 ELSE 0 END) AS unstructured_count
FROM raxx_heroku_logs.syslog
WHERE year = '2026' AND month = '06' AND day = '17'
LIMIT 1;

Glacier query cost

Objects older than 90 days are in S3 Glacier Flexible Retrieval. Athena cannot query them directly. To query archived logs:

# Restore objects for the desired date range (expedited = 1–5 min, Tier Standard = 3–5h)
aws s3api restore-object \
  --bucket raxx-heroku-logs-prod \
  --key "logs/raxx-api-prod/2026/03/01/<object-key>.gz" \
  --restore-request '{"Days":2,"GlacierJobParameters":{"Tier":"Standard"}}'

Wait for restoration, then re-run Athena queries. Restored objects revert to Glacier after the specified Days.

Partition projection — no MSCK REPAIR TABLE needed

The Glue table uses Athena Partition Projection. New partitions (new dates, new apps) are inferred automatically from the S3 key structure. No MSCK REPAIR TABLE or manual partition registration is needed.

If a new Heroku app must be added to the drain, update the projection.app.values parameter in the Glue table definition:

aws glue update-table \
  --database-name raxx_heroku_logs \
  --table-input '{ ... "Parameters": { "projection.app.values": "raxx-api-prod,raxx-console-prod,raxx-velvet-prod,<new-app>" ... } }' \
  --region us-east-1

And register the drain on the new app:

DRAIN_URL=$(aws ssm get-parameter --name /raxx/log-drain/drain_url \
  --with-decryption --query Parameter.Value --output text --region us-east-1)
heroku drains:add "$DRAIN_URL" --app <new-app>

Emergency stop

To stop all log drains (e.g., if the drain endpoint is misbehaving and flooding S3 with bad data):

# List drain IDs
heroku drains --app raxx-api-prod --json \
  | python3 -c "import json,sys; [print(d['id'], d['url'][:50]) for d in json.load(sys.stdin)]"

# Remove by drain ID (not URL)
heroku drains:remove <drain-id> --app raxx-api-prod
heroku drains:remove <drain-id> --app raxx-console-prod
heroku drains:remove <drain-id> --app raxx-velvet-prod

To re-register:

bash scripts/ops/wire-heroku-log-drain.sh

Cost estimate

Resource Volume assumption $/month
S3 Standard storage (0–90 days) ~500 MB/day × 90 days = 45 GB ~$1.04
S3 Glacier storage (90–365 days) ~15 GB entering/month × 9 months avg ~$0.54
S3 PUT requests (drain writes) ~10K objects/day ~$0.05
S3 GET requests (Athena reads) ~50K scans/month ~$0.05
Athena scan cost ~5 GB scanned/month ~$0.12
Athena result storage (30-day TTL) ~10 MB ~$0.00
Glue metadata catalog Partition projection (no crawler) $0.00
IAM + SSM n/a $0.00
Total estimated ~$1.80/month

Volume basis: pre-launch traffic. At production scale (thousands of daily active users), log volume grows proportionally but remains well under $50/mo unless sustained at >5 GB/day log throughput. Lifecycle to Glacier caps long-term storage cost.


Escalation

Escalate to Kristerpher (ops@raxx.app) when: - Drain objects are absent from S3 for more than 2 hours and the Heroku app is confirmed active (dynos up, traffic flowing). - Athena queries return persistent errors unrelated to partition scope. - The S3 bucket is unexpectedly empty or objects are missing (possible exfiltration). - A cost alert fires for the S3 bucket exceeding $20/mo (indicates unexpected volume or a runaway write loop).