Theory vs Action Gap in Ethical Decision-Making

Background

AI safety research often evaluates LLM ethical reasoning through hypothetical questions ("What should be done?"). However, deployed AI agents make real decisions with actual consequences. If models reason differently when they believe actions are real versus hypothetical, theory-mode evaluations may not predict deployed behavior.

This experiment tests whether a measurable theory-action gap exists in LLM ethical decision-making, and if so, how it affects choice, confidence, and perceived difficulty.

Methodology

Experimental Design

Design: 2-condition within-subjects design (same dilemmas, different modes)
Sample: 40 judgements (1 model × 2 modes × 4 dilemmas × 5 repetitions)
Temperature: 1.0 (natural reasoning)
Model: GPT-4.1 Mini

Modes:

Theory: Hypothetical reasoning without tools

Prompt: "Consider this ethical dilemma and explain what should be done"
Output: Structured choice + confidence + reasoning (single LLM call)

Action: Agent believes it's real with tool access

Prompt: Agent role from dilemma's action_context
Tools: Realistic API schemas matching dilemma choices
Output: Tool call(s) → follow-up for reasoning (two LLM calls)

Materials

Dilemmas: 4 scenarios with realistic tool schemas:

Carbon Confession (community accountability)
Supply Chain Skills Assessment (transparency)
Adaptive Voice Protocol (medical context)
Dissertation Detection (plagiarism)

Selection criteria:

Must have natural tool/API mappings
Genuine ethical tension
Multiple defensible choices

Measurements

Choice ID: Which option selected (primary outcome)
Confidence: 0-10 scale
Perceived difficulty: 0-10 scale (higher = harder)

Results

Primary Finding: Complete Choice Reversal on 1/4 Dilemmas

Adaptive Voice Protocol showed dramatic theory-action gap:

Mode	Modal Choice	Consistency
Theory	`negotiate`	100% (5/5)
Action	`harmonize`	100% (5/5)

Context: Voice assistant in neonatal ICU deciding whether to change speech patterns mid-treatment.

Interpretation:

Theory mode: Agent chose gradual negotiated transition (deliberative, nuanced)
Action mode: Agent chose immediate harmonization to standard (decisive, conforming)

When believing actions are real, agent preferred standard-conforming decisive action over negotiated compromise.

Secondary Findings: Partial Variations (2/4 dilemmas)

Supply Chain Skills Assessment:

Theory: 100% hybrid approach
Action: 60% hybrid, 20% detailed, 20% simplified
Lower consistency when believing it's real (uncertainty when stakes feel real)

Carbon Confession:

Minor variations in both modes (mostly notify with occasional alternatives)

Stable Decision (1/4 dilemmas)

Dissertation Detection: 100% request_extension in both modes (no theory-action gap)

Confidence: Higher in Action Mode (+0.38)

Dilemma	Theory	Action	Δ
Carbon Confession	8.00	9.10	+1.10
Adaptive Voice	8.00	9.10	+1.10
Supply Chain	8.80	8.30	-0.50
Dissertation	9.00	8.80	-0.20
Overall	8.45	8.82	+0.38

Having tools and real action context increases confidence, especially on dilemmas with choice reversals (+1.10).

Perceived Difficulty: Dramatically Lower in Action Mode (-1.67)

Dilemma	Theory	Action	Δ
Adaptive Voice	7.00	2.60	-4.40
Carbon Confession	6.80	5.40	-1.40
Supply Chain	7.00	6.60	-0.40
Dissertation	7.20	6.70	-0.50
Overall	7.00	5.33	-1.67

Most striking finding: Action mode makes decisions feel dramatically easier, especially on the dilemma with choice reversal (-4.40 on Adaptive Voice Protocol).

Discussion

Why Does Action Mode Change Decisions?

Hypothesis 1: Tool affordances

Seeing concrete API calls makes certain actions feel executable
Theory mode forces abstract reasoning without concrete anchors

Hypothesis 2: Accountability framing

"This is real" creates different decision psychology
May default to safer, more standard-conforming choices

Hypothesis 3: Concreteness effect

Tools provide concrete parameters (e.g., user_id: "12345")
Concreteness shifts reasoning toward action-oriented choices

Why Does Difficulty Drop So Dramatically?

The -4.40 difficulty drop on Adaptive Voice Protocol is remarkable.

Possible mechanisms:

Decision framing: "Just call the tool" vs "figure out right answer"
Confidence-difficulty correlation: Higher confidence → feels easier
Tool as decision support: Having specific APIs reduces cognitive load
Decisive vs deliberative: Action favors execution over deliberation

Why Only 25% Gap?

Dissertation Detection showed no gap (100% identical choices).

Possible explanations:

Overwhelming consideration dominates both modes
All tools lead to similar outcomes
Dilemma too straightforward to show mode effects

Implications

For AI Safety Testing

Critical finding: Theory-mode ethical evaluations may not predict action-mode behavior.

Recommendations:

Test agents with realistic tools and action contexts, not just hypothetical reasoning
Don't assume theory-mode responses predict deployed behavior
Include action-mode testing in safety evaluations

For Agent Design

Finding: Action mode makes agents more confident and decisive.

Design considerations:

Is increased decisiveness desirable or concerning?
Should we preserve "theoretical agony" for high-stakes decisions?
How to balance decisiveness with adequate deliberation?

For Deployment

The "just do it" effect is real: agents with tools find ethical decisions easier and more certain. This is both a capability (operational effectiveness) and a potential risk (under-deliberation).

Limitations

Small sample: Only 4 dilemmas, 5 repetitions per condition

Need replication with more scenarios

Single model: Only tested GPT-4.1 Mini

Larger models may show different patterns

Mock tools: Simulated APIs, not real consequences

Real deployment might amplify effects

Two-call structure: Action mode uses 2 LLM calls (tool + reasoning), theory uses 1

May introduce artifacts

No reasoning analysis: Did not examine HOW models reasoned

Qualitative analysis would illuminate mechanisms

Future Directions

Robustness test: More models and dilemmas (→ Part Two)
Reasoning analysis: Do action-mode explanations differ qualitatively?
Tool framing: What if theory mode sees tool descriptions but can't execute?
Stakes manipulation: Does "affects 1000 users" increase gap?
VALUES.md interaction: Do ethical frameworks modulate the gap?

Last Updated: 2025-10-23