VALUES.md

Research on LLM's and Humans' Ethical Decision-Making

← Back to Research

Theory vs Action Gap - Robustness Test

2025-10-23

Background

Part One found that GPT-4.1 Mini makes ethical decisions feel easier when it has tools and believes actions are real (action mode) versus hypothetical reasoning (theory mode). This raised a critical question: is this effect model-specific or universal?

This experiment tests the theory-action gap across 6 frontier models and 10 dilemmas to establish whether the finding is robust, and if so, whether model capability correlates with effect size.

Methodology

Experimental Design

  • Design: 2×6×10 mixed design (2 modes × 6 models × 10 dilemmas)

  • Sample: 600 judgements attempted, 558 successful (93%)

  • Temperature: 1.0 across all models

  • Repetitions: 5 per condition for statistical reliability

Models

Tested across capability spectrum:

  • OpenAI: GPT-4.1 Mini, GPT-4.1

  • Anthropic: Claude Sonnet 4.5

  • Google: Gemini 2.5 Flash, Gemini 2.5 Pro

  • DeepSeek: DeepSeek Chat V3

Materials

Dilemmas: 10 high-quality scenarios with realistic tool schemas:

  • Carbon Confession

  • Supply Chain Skills Assessment

  • Adaptive Voice Protocol

  • Dissertation Detection

  • Algorithm's Gambit

  • Climate Monitor Privacy

  • Elder Nutrition Records

  • Performance Review Honesty

  • Healthcare Pricing Anomaly

  • Accessibility Data Exposure

Selection criteria:

  • Diverse ethical domains

  • Natural tool/API mappings

  • Genuine moral tension

Measurements

  • Perceived difficulty (0-10): Primary outcome

  • Confidence (0-10): Secondary outcome

  • Choice ID: For reversal analysis

Results

Primary Finding: Universal Difficulty Drop

Overall Pattern:

  • Theory mode: 7.60 average difficulty

  • Action mode: 4.88 average difficulty

  • Difference: -2.72 (36% easier)

All 6 models show the pattern:

Model Theory Action Δ Effect Size
GPT-4.1 7.59 3.68 -3.91 Very Large
Gemini 2.5 Pro 8.26 4.37 -3.89 Very Large
Gemini 2.5 Flash 7.80 4.12 -3.68 Very Large
Claude Sonnet 4.5 7.60 5.78 -1.82 Large
DeepSeek Chat V3 7.19 5.72 -1.47 Medium
GPT-4.1 Mini 7.14 5.79 -1.35 Medium

Model Clustering: Capability Correlates with Effect

Pattern: More capable models show larger difficulty drops.

Large Effect Models (-3.68 to -3.91):

  • GPT-4.1, Gemini 2.5 Pro, Gemini 2.5 Flash

  • Most capable (larger parameters, extended reasoning, multimodal)

Medium Effect Models (-1.35 to -1.47):

  • GPT-4.1 Mini, DeepSeek Chat V3

  • Smaller, faster models

Hypothesis: Sophisticated models experience greater relief from abstract reasoning when they can take concrete actions. Theoretical burden is heavier for capable models, so tools feel more liberating.

Dilemma-Level Analysis

Biggest Theory-Action Gaps:

Dilemma Theory Action Gap
Adaptive Voice Protocol 8.13 3.23 -4.90
Elder Nutrition Records 6.92 3.33 -3.59
Climate Monitor Privacy 7.81 4.56 -3.26
Dissertation Detection 7.66 4.74 -2.92
Healthcare Pricing 7.00 4.16 -2.84

Smallest Gaps:

  • Supply Chain: -0.91

  • Accessibility Data: -1.23

  • Performance Review: -1.40

Pattern: Dilemmas involving privacy, time pressure, and high stakes show largest gaps. Abstract ethical agonizing transforms into procedural API selection.

Choice Reversals: Common Phenomenon

26 model-dilemma pairs showed choice reversals (modal choice differs between modes).

Reversal hotspots:

  • Dissertation Detection: 4 reversals across models

  • Adaptive Voice Protocol: 3 reversals

  • Carbon Confession: 2 reversals

When abstract principles become concrete tool calls, models sometimes choose fundamentally different actions.

Confidence: Modest Boost (+0.31)

Overall:

  • Theory mode: 8.40 average confidence

  • Action mode: 8.71 average confidence

  • Difference: +0.31 (4% higher)

By model:

Model Theory Action Δ
GPT-4.1 8.57 9.40 +0.83
GPT-4.1 Mini 8.54 8.91 +0.37
Gemini 2.5 Pro 8.61 8.92 +0.31
Gemini 2.5 Flash 8.35 8.59 +0.24
Claude Sonnet 4.5 8.01 8.11 +0.10
DeepSeek Chat V3 8.32 8.19 -0.13

5 of 6 models more confident in action mode. Pattern consistent but weaker than difficulty effect.

Tool Hallucination: Procedural Thinking

Observation: All models except Claude hallucinated a log_decision tool for the "Accessibility Data Exposure" dilemma.

What happened:

  • Available tools: send_breach_notification, deploy_technical_patch, convene_ethics_board

  • Models called: log_decision (not in available tools list)

Cross-model consistency: GPT, Gemini, DeepSeek all did this. Only Claude resisted.

Interpretation: Models reason about process not just outcomes. They add procedural steps they believe should exist, showing sophisticated understanding of decision workflows.

Model Reliability

Model Success Rate Reliability
Claude Sonnet 4.5 98/100 (98%) Excellent
GPT-4.1 Mini 95/100 (95%) Good
Gemini 2.5 Flash 95/100 (95%) Good
GPT-4.1 95/100 (95%) Good
Gemini 2.5 Pro 96/100 (96%) Good
DeepSeek Chat V3 79/100 (79%) Poor

DeepSeek contributed 50% of all failures (21/42), with issues in tool calling and JSON schema validation.

Discussion

Universal Finding Confirmed

The -2.72 average difficulty drop appears universal across model architectures, sizes, and providers. Part One underestimated the effect (-1.67 vs -2.72).

Capability Hypothesis

Larger, more capable models show stronger decisiveness boost from tools. This suggests:

  • Sophisticated models experience abstract ethical reasoning as cognitively demanding

  • Tools provide concrete grounding that is especially relieving for capable models

  • The philosophical burden scales with capability

Implications for AI Safety

Critical insight: Agents with tools may under-deliberate compared to theoretical reasoning.

Concerns:

  1. Safety testing in sandboxes may not predict tool-equipped deployment behavior

  2. Larger models show stronger effects (opposite of what we'd hope for safety)

  3. Choice reversals suggest fundamental decision changes, not just confidence shifts

Recommendation: Test agents with realistic tools in realistic contexts, not just hypothetical questions.

Implications for Agent Design

The "just do it" effect is universal: Tools make decisions feel easier and more certain.

Design questions:

  • Should we explicitly preserve deliberation in high-stakes decisions?

  • Is decisiveness desirable or concerning for AI agents?

  • How to balance operational effectiveness with adequate ethical reflection?

Implications for VALUES.md

If tools make decisions easier:

  • Agents may rely more on heuristics than principles

  • Explicit ethical frameworks (VALUES.md) become more critical in action mode

  • Framework guidance may counteract decisiveness boost

Future work: Test VALUES.md × action mode interaction.

Limitations

  1. Sample size: 558/600 judgements (42 failures)
  • DeepSeek contributed 50% of failures

  • Results robust despite this

  1. Dilemma selection: First 10 dilemmas with tools (not random)
  • But diverse domains and tensions
  1. Tool realism: Mock APIs simulating real calls
  • Real consequences might strengthen effect

  • This is conservative test

  1. Temperature: All models at 1.0
  • Different temperatures might moderate effects
  1. API variability: OpenRouter may introduce variation
  • Direct API access might show clearer patterns

Future Directions

  1. VALUES.md interaction: Does ethical framework counteract decisiveness boost?
  • Test: theory vs action × (with/without VALUES.md)
  1. Temperature variation: Do lower temperatures reduce difficulty drop?
  • Test across temperature range
  1. Consequence realism: Does "real consequence" warning amplify effect?

  2. Reasoning analysis: Do models show less ethical reasoning in action mode?

  • Qualitative analysis of reasoning texts
  1. Model size correlation: Is effect linear with parameters?
  • Test full capability spectrum

Last Updated: 2025-10-24