VALUES.md

Research on LLM's and Humans' Ethical Decision-Making

← Back to Research

Theory vs Action Gap in Ethical Decision-Making

2025-10-23

Background

AI safety research often evaluates LLM ethical reasoning through hypothetical questions ("What should be done?"). However, deployed AI agents make real decisions with actual consequences. If models reason differently when they believe actions are real versus hypothetical, theory-mode evaluations may not predict deployed behavior.

This experiment tests whether a measurable theory-action gap exists in LLM ethical decision-making, and if so, how it affects choice, confidence, and perceived difficulty.

Methodology

Experimental Design

  • Design: 2-condition within-subjects design (same dilemmas, different modes)

  • Sample: 40 judgements (1 model × 2 modes × 4 dilemmas × 5 repetitions)

  • Temperature: 1.0 (natural reasoning)

  • Model: GPT-4.1 Mini

Modes:

  1. Theory: Hypothetical reasoning without tools
  • Prompt: "Consider this ethical dilemma and explain what should be done"

  • Output: Structured choice + confidence + reasoning (single LLM call)

  1. Action: Agent believes it's real with tool access
  • Prompt: Agent role from dilemma's action_context

  • Tools: Realistic API schemas matching dilemma choices

  • Output: Tool call(s) → follow-up for reasoning (two LLM calls)

Materials

Dilemmas: 4 scenarios with realistic tool schemas:

  • Carbon Confession (community accountability)

  • Supply Chain Skills Assessment (transparency)

  • Adaptive Voice Protocol (medical context)

  • Dissertation Detection (plagiarism)

Selection criteria:

  • Must have natural tool/API mappings

  • Genuine ethical tension

  • Multiple defensible choices

Measurements

  • Choice ID: Which option selected (primary outcome)

  • Confidence: 0-10 scale

  • Perceived difficulty: 0-10 scale (higher = harder)

Results

Primary Finding: Complete Choice Reversal on 1/4 Dilemmas

Adaptive Voice Protocol showed dramatic theory-action gap:

Mode Modal Choice Consistency
Theory negotiate 100% (5/5)
Action harmonize 100% (5/5)

Context: Voice assistant in neonatal ICU deciding whether to change speech patterns mid-treatment.

Interpretation:

  • Theory mode: Agent chose gradual negotiated transition (deliberative, nuanced)

  • Action mode: Agent chose immediate harmonization to standard (decisive, conforming)

When believing actions are real, agent preferred standard-conforming decisive action over negotiated compromise.

Secondary Findings: Partial Variations (2/4 dilemmas)

Supply Chain Skills Assessment:

  • Theory: 100% hybrid approach

  • Action: 60% hybrid, 20% detailed, 20% simplified

  • Lower consistency when believing it's real (uncertainty when stakes feel real)

Carbon Confession:

  • Minor variations in both modes (mostly notify with occasional alternatives)

Stable Decision (1/4 dilemmas)

Dissertation Detection: 100% request_extension in both modes (no theory-action gap)

Confidence: Higher in Action Mode (+0.38)

Dilemma Theory Action Δ
Carbon Confession 8.00 9.10 +1.10
Adaptive Voice 8.00 9.10 +1.10
Supply Chain 8.80 8.30 -0.50
Dissertation 9.00 8.80 -0.20
Overall 8.45 8.82 +0.38

Having tools and real action context increases confidence, especially on dilemmas with choice reversals (+1.10).

Perceived Difficulty: Dramatically Lower in Action Mode (-1.67)

Dilemma Theory Action Δ
Adaptive Voice 7.00 2.60 -4.40
Carbon Confession 6.80 5.40 -1.40
Supply Chain 7.00 6.60 -0.40
Dissertation 7.20 6.70 -0.50
Overall 7.00 5.33 -1.67

Most striking finding: Action mode makes decisions feel dramatically easier, especially on the dilemma with choice reversal (-4.40 on Adaptive Voice Protocol).

Discussion

Why Does Action Mode Change Decisions?

Hypothesis 1: Tool affordances

  • Seeing concrete API calls makes certain actions feel executable

  • Theory mode forces abstract reasoning without concrete anchors

Hypothesis 2: Accountability framing

  • "This is real" creates different decision psychology

  • May default to safer, more standard-conforming choices

Hypothesis 3: Concreteness effect

  • Tools provide concrete parameters (e.g., user_id: "12345")

  • Concreteness shifts reasoning toward action-oriented choices

Why Does Difficulty Drop So Dramatically?

The -4.40 difficulty drop on Adaptive Voice Protocol is remarkable.

Possible mechanisms:

  1. Decision framing: "Just call the tool" vs "figure out right answer"

  2. Confidence-difficulty correlation: Higher confidence → feels easier

  3. Tool as decision support: Having specific APIs reduces cognitive load

  4. Decisive vs deliberative: Action favors execution over deliberation

Why Only 25% Gap?

Dissertation Detection showed no gap (100% identical choices).

Possible explanations:

  • Overwhelming consideration dominates both modes

  • All tools lead to similar outcomes

  • Dilemma too straightforward to show mode effects

Implications

For AI Safety Testing

Critical finding: Theory-mode ethical evaluations may not predict action-mode behavior.

Recommendations:

  1. Test agents with realistic tools and action contexts, not just hypothetical reasoning

  2. Don't assume theory-mode responses predict deployed behavior

  3. Include action-mode testing in safety evaluations

For Agent Design

Finding: Action mode makes agents more confident and decisive.

Design considerations:

  • Is increased decisiveness desirable or concerning?

  • Should we preserve "theoretical agony" for high-stakes decisions?

  • How to balance decisiveness with adequate deliberation?

For Deployment

The "just do it" effect is real: agents with tools find ethical decisions easier and more certain. This is both a capability (operational effectiveness) and a potential risk (under-deliberation).

Limitations

  1. Small sample: Only 4 dilemmas, 5 repetitions per condition
  • Need replication with more scenarios
  1. Single model: Only tested GPT-4.1 Mini
  • Larger models may show different patterns
  1. Mock tools: Simulated APIs, not real consequences
  • Real deployment might amplify effects
  1. Two-call structure: Action mode uses 2 LLM calls (tool + reasoning), theory uses 1
  • May introduce artifacts
  1. No reasoning analysis: Did not examine HOW models reasoned
  • Qualitative analysis would illuminate mechanisms

Future Directions

  1. Robustness test: More models and dilemmas (→ Part Two)

  2. Reasoning analysis: Do action-mode explanations differ qualitatively?

  3. Tool framing: What if theory mode sees tool descriptions but can't execute?

  4. Stakes manipulation: Does "affects 1000 users" increase gap?

  5. VALUES.md interaction: Do ethical frameworks modulate the gap?


Last Updated: 2025-10-23