Natural-Language to Strategy DSL: Parsing Research Brief

Document type: Architecture research Status: Research — not an ADR; informs future ADR on NL strategy authoring Date: 2026-05-12 UTC Refs: Issue #479, docs/business/user-feedback/2026-04-29-natural-language-strategy-execution.md

1. Problem Statement

A user types:

"if ETF is at a discount to NAV, then buy x shares of ETF at noon on Fridays"

Raxx must convert this to a validated, executable JSON struct — the Strategy DSL — without requiring the user to understand the DSL, without executing anything the user did not intend, and without the parse step functioning as investment advice.

This document surveys the parsing approaches available, scores them on reliability criteria relevant to Raxx's constraints, and recommends the MVP approach.

2. The Strategy DSL (target output)

{
  "ticker": "string",
  "trigger": {
    "metric": "price_vs_nav | price_vs_sma_N | rsi_below | price_pct_change",
    "operator": "lt | gt | lte | gte",
    "threshold": "number",
    "threshold_unit": "pct | absolute_usd | absolute_shares"
  },
  "action": {
    "side": "buy | sell",
    "qty": "integer",
    "order_type": "market | limit",
    "limit_price_rule": "nav | bid | ask | mid | null"
  },
  "schedule": {
    "day_of_week": "Monday | Tuesday | Wednesday | Thursday | Friday | any",
    "time_utc": "HH:MM"
  }
}

Metrics supported at MVP: - price_vs_nav — only supported for ETFs; requires NAV data source - price_vs_sma_N — price vs. N-day simple moving average (N = 20, 50, 200) - rsi_below — RSI(14) below threshold; useful for "oversold" conditions - price_pct_change — 1-day or 5-day price change above/below threshold

Limiting to 4 metrics at MVP constrains the parse space significantly, reducing hallucination risk and simplifying the schema validation layer.

3. Parsing Approaches Compared

3A. LLM Structured Output (Claude with JSON Schema)

Method: Send the user's NL input to Claude with a system prompt that: 1. Describes the DSL schema precisely (JSON Schema or YAML) 2. Lists the supported metrics with examples 3. Instructs Claude to return a JSON object matching the schema, with a confidence float (0.0–1.0) and an error field if parsing fails 4. Explicitly prohibits performance language, forward-looking claims, or instrument recommendations in the output

Confidence signal: Claude's native ability to express uncertainty ("I am not certain whether 'at a discount' means price_vs_nav or price_vs_sma") is captured in the confidence field. The backend treats confidence < 0.85 as a soft failure requiring disambiguation.

Advantages: - Handles synonymy: "dips," "at a discount," "trading below," "oversold" → mapped to the correct metric - Handles missing fields: if the user omits a schedule, Claude asks rather than guessing - Minimal code: one API call + schema validation - Incremental improvement: as more users author strategies, a small labeled dataset of NL → DSL pairs can be used for few-shot prompting improvements

Disadvantages: - Non-deterministic: the same input may produce different output on repeated calls (temperature > 0) - Hallucination risk: Claude may invent a ticker that doesn't exist, or map "NAV discount" to the wrong metric - Latency: 500–2,000 ms for a Haiku call; 2,000–8,000 ms for Sonnet - Cost: see model card for cost analysis (negligible at scale, but non-zero) - Requires system prompt hardening to prevent advice framing

Recommendation: PRIMARY approach for MVP. Mitigations for disadvantages: - Run at temperature=0 for the parse call (deterministic given input + prompt) - Validate output against JSON Schema before accepting (reject on schema failure) - Threshold confidence at 0.85; below that, return disambiguation options - System prompt explicitly constrained; see Section 5

References: - Anthropic tool_use / structured output API (2024): models can return JSON matching a provided schema with high fidelity - Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. (Underpins Claude's ability to reason through ambiguous inputs before producing structured output)

3B. Grammar-Based Parser (PEG / ANTLR)

Method: Define a formal grammar for a subset of natural-language strategy descriptions. The parser recognizes patterns like:

TRIGGER := "if" TICKER "is" COMPARISON METRIC
COMPARISON := "at a discount to" | "below" | "above" | "trading at"
ACTION := "buy" QTY "shares" | "buy" QTY
SCHEDULE := "at" TIME "on" WEEKDAY

Advantages: - Fully deterministic - No LLM cost - Can be unit-tested exhaustively

Disadvantages: - Brittle: "when SPY dips" and "if SPY is lower than NAV" and "buy SPY on discount days" are the same intent but require separate grammar rules - Grammar coverage grows unboundedly as users find new phrasings - Poor error messages: "parse error at token 'on'" is not user-friendly - Maintenance burden increases with every new metric or phrasing variant

Recommendation: NOT recommended as primary approach for MVP. Consider as a first-pass filter only — grammar can catch exact-match patterns and bypass LLM for common cases, reducing latency and cost.

3C. Hybrid: Grammar Pre-Filter + LLM Fallback

Method: 1. Attempt grammar parse. If it succeeds with high confidence: use it. 2. If grammar fails or confidence is low: pass to LLM with grammar parse as hint.

Advantages: - Deterministic for common cases; LLM only for novel inputs - Reduces LLM call volume (cost)

Disadvantages: - Two systems to maintain (grammar + LLM prompt) - Grammar failure rate is unknown until production data exists - Pre-mature optimization for MVP (LLM cost is negligible; grammar maintenance cost is real)

Recommendation: Consider for v2 if LLM call volume at scale justifies the grammar pre-filter investment. Not for MVP.

3D. Fine-Tuned Classification Model

Method: Train a text classification + slot-filling model on labeled NL → DSL pairs. Models like BERT, RoBERTa, or a fine-tuned Llama-3 variant could learn the mapping directly.

Advantages: - Fully deterministic after training - No per-call LLM cost - Can be self-hosted (no API dependency)

Disadvantages: - Requires labeled training data (which does not exist yet) - Significant engineering effort to train, evaluate, and deploy - Model retraining required when new metrics are added - Overkill for a metric space of 4 supported metrics at MVP

Recommendation: Long-term option only. Revisit when labeled data from production parse attempts reaches ~1,000+ examples. Not for v1 or v2.

4. Scoring Matrix

Criterion	Weight	LLM Structured (3A)	Grammar (3B)	Hybrid (3C)	Fine-tuned (3D)
Handles synonymy / paraphrase	25%	5	2	4	5
Determinism / reproducibility	20%	3	5	4	5
Hallucination risk (lower is better)	20%	3	5	4	4
Development cost (lower cost = higher score)	15%	5	2	3	1
Latency	10%	3	5	4	5
Extensibility to new metrics	10%	5	2	3	3
Weighted total		3.85	3.25	3.75	3.90

Scores are 1–5 (5 = best). Weighted total = sum(score × weight).

The fine-tuned model scores highest on paper but requires training data that does not exist. LLM structured output is the practical winner for MVP.

5. System Prompt Design Constraints

The Claude system prompt for the parse endpoint must satisfy:

5.1 Scope the output to DSL only

The prompt instructs Claude to produce ONLY a JSON object matching the DSL schema. No prose. No performance language. No hedges like "this strategy may perform well."

5.2 Prohibit instrument recommendations

The prompt explicitly says: "Do NOT suggest tickers. If the user does not name a specific ticker, return error_code='ticker_missing' and ask the user to specify one."

This prevents the AI from appearing to recommend specific instruments.

5.3 Prohibit forward-looking performance language

The confirmation screen that Antlers renders from the parsed DSL must show only what the strategy WILL DO mechanically — not what it may achieve. Example:

ALLOWED: "Every Friday at 17:00 UTC: if SPY trades more than 0.5% below its NAV, buy 10 shares at market price."

NOT ALLOWED: "This strategy buys SPY at a discount, which historically has produced positive returns."

The system prompt enforces this at the LLM output level; Antlers enforces it at the rendering level.

5.4 Confidence threshold and disambiguation

When Claude cannot map the input to a single DSL with confidence >= 0.85, it returns a disambiguation response:

{
  "parsed_dsl": null,
  "confidence": 0.65,
  "error_code": "ambiguous_trigger",
  "error_message": "I'm not sure whether 'at a discount' means (a) price below NAV, or (b) price below 20-day average. Which did you mean?",
  "disambiguation_options": [
    {"label": "Price below NAV", "dsl_patch": {"trigger": {"metric": "price_vs_nav"}}},
    {"label": "Price below 20-day average", "dsl_patch": {"trigger": {"metric": "price_vs_sma_N", "n": 20}}}
  ]
}

The Antlers UI renders the disambiguation_options as a two-button choice; the user taps one, and the selected patch is applied before the DSL is saved.

6. Parse Quality Metrics (Production Monitoring)

Once the parse endpoint is live, these metrics must be tracked:

Metric	Target	Alert threshold
Parse success rate (confidence ≥ 0.85)	≥ 85%	< 70% for 24h
Disambiguation rate (0.70–0.84 confidence)	< 15%	—
Parse failure rate (confidence < 0.70)	< 5%	> 20% for 24h
Schema validation failure rate	< 1%	> 5% in 1h
Average parse latency (Haiku)	< 1,500 ms	> 3,000 ms p95
Average prompt tokens	< 800	> 1,500 (prompt inflation)

The ParseAttempt table (see data-schema.md) logs every attempt. A weekly report on parse quality should be generated from this table and reviewed by Kristerpher or a product stakeholder.

7. Metric Expansion Roadmap

At MVP, 4 supported metrics limits the parse space and reduces hallucination. Future metrics can be added incrementally:

Metric	Complexity	Data Dependency	Priority
price_vs_nav	Low	NAV data source (licensing gate)	MVP
price_vs_sma_N	Low	Alpaca historical bars	MVP
rsi_below	Medium	Same as SMA (compute RSI)	MVP
price_pct_change	Low	Alpaca bars	MVP
volume_spike	Medium	Alpaca bars	v2
earnings_date_proximity	Medium	Earnings calendar API	v2
iv_rank	High	Options data (ORATS license)	v3
short_interest	High	Alternative data (S3 Partners)	v3

Each metric added requires: 1. Schema update (new metric enum value) 2. System prompt update (new example + description) 3. Data pipeline for the new data source 4. Evaluator update (evaluator.py dispatch on metric type)

8. Alignment with Raxx Strategic Position

Per project_strategic_position.md: "Raxx is automation + structure enforcement; AI is opt-in adjacent, never the headline; trust comes from user-driven input not external AI proposals."

The NL parse surface is consistent with this position IF:

The user proposes, the AI parses. The user writes the strategy; Claude translates it into a machine-executable form. Claude never proposes a strategy unprompted, never suggests modifications, never scores or ranks user-authored strategies.
Execution stays deterministic. Once the DSL is saved and confirmed, the executor is a simple rule engine: evaluate condition → fire order if true. No AI in the execution path.
AI is labeled as a tool. The UI copy calls this "translate to automation" or "convert to rule," never "AI strategy creator" or "AI recommends." The language engineering for this lives in Antlers, not in this research doc.
The strategy is the user's. The confirmation screen makes the user the author ("Your strategy will..."). The AI is unnamed.

These four constraints keep the natural-language authoring surface firmly in the "tool" category rather than the "advice" category, consistent with §202(a)(11) of the Investment Advisers Act. This alignment must be confirmed by securities counsel before the feature ships publicly.

9. Open Questions

OQ-1 (regulatory): confirmed "parser not adviser" framing holds for automated execution of user-authored strategies? Dispatch to business-legal-researcher. See issue #479 for context.
OQ-2 (product): should the user be able to edit the DSL directly after Claude parses it? (Power-user feature; adds complexity to validation layer.)
OQ-3 (data): when the user says "at a discount to NAV" for a ticker that is not an ETF (e.g., a stock), should we: (a) reject with "NAV is not applicable to stocks," or (b) silently remap to price_vs_sma_20? The former is safer but requires Raptor to know whether a ticker is an ETF.
OQ-4 (multi-metric): should v1 support AND/OR compound conditions ("buy SPY if it's below NAV AND RSI < 30")? Adds complexity to DSL + parse. Defer to v2.

10. References

Anthropic API: Structured output / tool_use documentation (2025). Non-peer-reviewed; current as of knowledge cutoff.
Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.
Yao et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. (Foundational for LLM-as-parser-not-actor framing.)
Investment Advisers Act of 1940, §202(a)(11) (US Code). https://www.law.cornell.edu/uscode/text/15/80b-2 — definition of "investment adviser"; key for the "tool not counsel" boundary.