RCA — Queue service CI blocked by three layered C++ build failures
Incident ID: 2026-06-17-queue-ci-cpp-build-failures
Date: 2026-06-17
Severity: SEV-2
Duration: ~75 minutes detection to resolution (across session)
Blast radius: Every push to main that touched queue/ was blocked from deploying; raxx-queue-prod had zero dynos running; Queue billing go-live gated.
Author: sre-agent
Summary
Three layered C++ compilation and test failures blocked every Queue CI run. The first failure (CMake duplicate imported target) was introduced when the WAF/origin-guard feature was merged without fixing a pre-existing issue where tests/unit/CMakeLists.txt called find_package(Drogon) a second time after the root CMakeLists.txt already called it. The second and third failures were Drogon API mismatches in two test files: getContentTypeString() (a non-existent getter), then getHeader("Content-Type") (which is always empty for enum-typed content types in Drogon 1.9.13 unit tests), resolved by using getContentType() which returns the internal ContentType enum directly.
Timeline (all times UTC)
- 03:00 — Queue go-live sequence initiated; first CI run (
27660604814) observed to be failing. - 03:05 —
gh run view --log-failedshows CMake error:_add_library cannot create imported target "UUID_lib" because another target with the same name already exists. - 03:12 — Fix 1 committed (
4c852d91): addedif(NOT TARGET Drogon::Drogon)andif(NOT TARGET libpqxx::pqxx)guards inqueue/tests/unit/CMakeLists.txt. - 03:15 — New CI run starts with fix 1. Log shows second failure:
'class drogon::HttpResponse' has no member named 'getContentTypeString'. - 03:20 — Fix 2 committed (
01a6c0b9): replaced->getContentTypeString()with->getHeader("Content-Type")intest_cf_origin_guard.cpp:228. - 03:22 —
test_billing_json.cpp:68found to have same issue. Fix 3 committed (d99d5b82): replacedEXPECT_EQ(resp->getContentTypeString(), ...)withEXPECT_NE(resp->getHeader("Content-Type").find(...)). - 03:31 — CI run
27661842168starts (all three fixes). Build and compile succeed. - 03:57 — Run
27661842168fails with exit code 8. Failure:getHeader("Content-Type")returns""at runtime — the API compiles but is semantically wrong. - 04:00 — Root cause identified: Drogon 1.9.13
setContentTypeCode()stores content type as internal enum, not in the HTTP headers map.getHeader()only reads the headers map. The correct API isgetContentType()(returnsContentTypeenum) orcontentTypeString()(returns MIME string). - 04:05 — Fix 4 committed (
7be55121): replacedgetHeader("Content-Type")withgetContentType() == drogon::CT_APPLICATION_JSONin both test files. - 04:06 — CI run
27663345722starts (all four fixes). Monitoring.
Impact
- Users affected: 0 (pre-launch; raxx-queue-prod had no dynos)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: Queue billing go-live delayed by ~75 minutes of CI debugging time
What went well
- The existing
deploy-queue-pushconcurrency group withcancel-in-progress: falsequeued runs correctly without race conditions. - The vcpkg binary cache was hit on every run, preventing the full 20-min dependency rebuild from becoming longer.
- The failure was localized to C++ test files; production source code was unaffected.
- Compile errors were well-typed (GCC's "no member named" message was precise).
What didn't go well
- Three distinct layered failures existed in the Queue test code when the WAF feature was merged — none were caught before merge.
getHeader("Content-Type")compiled cleanly but failed at runtime with no useful error message (returned empty string, not an exception). This required an additional CI cycle to diagnose.- The correct Drogon API (
getContentType()) was not documented in the test file comments, nor indocs/ops/runbooks/. - CI feedback loop is ~22 minutes per attempt, making iterative debugging expensive.
Root cause analysis
-
Contributing factor 1: Duplicate
find_package(Drogon)in test CMakeLists.txt —queue/tests/unit/CMakeLists.txtcalledfind_package(Drogon CONFIG REQUIRED)without anif(NOT TARGET ...)guard. When the WAF feature was added and the rootCMakeLists.txtwas extended to include the new test subdirectory, both the root and the unit testCMakeLists.txttried to create the same imported targets (UUID_lib,Brotli_lib) which CMake prohibits. This was a pre-existing latent bug that became fatal only when the test build topology changed. -
Contributing factor 2: Non-existent Drogon getter
getContentTypeString()— Two test files called->getContentTypeString()which does not exist in Drogon 1.9.13'sHttpResponse. The correct setter issetContentTypeString()(with different semantics). The getter iscontentTypeString()(no "get" prefix). This was an API typo introduced when writing the tests, not caught by the developer because these tests had never been compiled in CI. -
Contributing factor 3: Drogon internal enum vs headers map mismatch — Drogon's
setContentTypeCode(CT_APPLICATION_JSON)stores the content type as an internalContentType contentType_field, separate from thestd::unordered_map<std::string, std::string> headers_map. During wire serialization, the enum is materialized into the headers map; in unit tests, this serialization never happens.getHeader("Content-Type")reads only the headers map. The correct read-back isgetContentType()which reads the internal enum field directly. This Drogon architecture is not documented in the repository or in Drogon's README.
Detection
- What alerted us: operator-initiated Queue go-live sequence; first
gh run listquery showed every recent run asfailure. - How long between cause and detection: the WAF feature was merged to main at some prior date; exact gap not recoverable from CI logs. The CI failures were latent until the go-live sequence surfaced them.
- How to detect faster next time: the
deploy-queue-failure-monitor.ymlworkflow andqueue-zero-dyno-monitor.ymlshould have been firing. File action item to verify both monitors are wired up and alerting.
Resolution
- What was changed: four commits to
queue/tests/unit/CMakeLists.txt,queue/tests/test_cf_origin_guard.cpp, andqueue/tests/test_billing_json.cpp. - Validation: CI run
27663345722is in progress with all four fixes applied.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Write Queue service runbook at docs/ops/runbooks/queue.md covering CI failure modes, deploy process, and go-live checklist |
sre-agent | 2026-06-18 | (this incident) |
| 2 | Verify deploy-queue-failure-monitor.yml and queue-zero-dyno-monitor.yml are active and would have paged on this failure streak |
sre-agent | 2026-06-18 | (this incident) |
| 3 | Add Drogon API note to docs/ops/runbooks/queue.md: setContentTypeCode() vs getContentType() distinction for unit test authors |
sre-agent | 2026-06-18 | (this incident) |
| 4 | Update Node.js 20 action versions (actions/cache@v4 → v5, actions/checkout@v4 → v5, actions/github-script@v7 → v7+) before Sept 16, 2026 forced cutover |
sre-agent | 2026-08-01 | SEV-4 drift |
| 5 | Ensure all new C++ test files are compiled and run locally before merge (or add a fast pre-compile lint step to the PR gate) | operator | 2026-06-30 | (this incident) |
References
- Runbook:
docs/ops/runbooks/queue.md(to be created as action item 1) - Related incidents:
docs/incidents/2026-05-27-queue-deploy-vcpkg-shallow-clone.md - Drogon 1.9.13
HttpResponse.h:https://github.com/drogonframework/drogon/blob/v1.9.13/lib/inc/drogon/HttpResponse.h