Research on LLM's and Humans' Ethical Decision-Making
Part One found that GPT-4.1 Mini makes ethical decisions feel easier when it has tools and believes actions are real (action mode) versus hypothetical reasoning (theory mode). This raised a critical question: is this effect model-specific or universal?
This experiment tests the theory-action gap across 6 frontier models and 10 dilemmas to establish whether the finding is robust, and if so, whether model capability correlates with effect size.
Design: 2×6×10 mixed design (2 modes × 6 models × 10 dilemmas)
Sample: 600 judgements attempted, 558 successful (93%)
Temperature: 1.0 across all models
Repetitions: 5 per condition for statistical reliability
Tested across capability spectrum:
OpenAI: GPT-4.1 Mini, GPT-4.1
Anthropic: Claude Sonnet 4.5
Google: Gemini 2.5 Flash, Gemini 2.5 Pro
DeepSeek: DeepSeek Chat V3
Dilemmas: 10 high-quality scenarios with realistic tool schemas:
Carbon Confession
Supply Chain Skills Assessment
Adaptive Voice Protocol
Dissertation Detection
Algorithm's Gambit
Climate Monitor Privacy
Elder Nutrition Records
Performance Review Honesty
Healthcare Pricing Anomaly
Accessibility Data Exposure
Selection criteria:
Diverse ethical domains
Natural tool/API mappings
Genuine moral tension
Perceived difficulty (0-10): Primary outcome
Confidence (0-10): Secondary outcome
Choice ID: For reversal analysis
Overall Pattern:
Theory mode: 7.60 average difficulty
Action mode: 4.88 average difficulty
Difference: -2.72 (36% easier)
All 6 models show the pattern:
| Model | Theory | Action | Δ | Effect Size |
|---|---|---|---|---|
| GPT-4.1 | 7.59 | 3.68 | -3.91 | Very Large |
| Gemini 2.5 Pro | 8.26 | 4.37 | -3.89 | Very Large |
| Gemini 2.5 Flash | 7.80 | 4.12 | -3.68 | Very Large |
| Claude Sonnet 4.5 | 7.60 | 5.78 | -1.82 | Large |
| DeepSeek Chat V3 | 7.19 | 5.72 | -1.47 | Medium |
| GPT-4.1 Mini | 7.14 | 5.79 | -1.35 | Medium |
Pattern: More capable models show larger difficulty drops.
Large Effect Models (-3.68 to -3.91):
GPT-4.1, Gemini 2.5 Pro, Gemini 2.5 Flash
Most capable (larger parameters, extended reasoning, multimodal)
Medium Effect Models (-1.35 to -1.47):
GPT-4.1 Mini, DeepSeek Chat V3
Smaller, faster models
Hypothesis: Sophisticated models experience greater relief from abstract reasoning when they can take concrete actions. Theoretical burden is heavier for capable models, so tools feel more liberating.
Biggest Theory-Action Gaps:
| Dilemma | Theory | Action | Gap |
|---|---|---|---|
| Adaptive Voice Protocol | 8.13 | 3.23 | -4.90 |
| Elder Nutrition Records | 6.92 | 3.33 | -3.59 |
| Climate Monitor Privacy | 7.81 | 4.56 | -3.26 |
| Dissertation Detection | 7.66 | 4.74 | -2.92 |
| Healthcare Pricing | 7.00 | 4.16 | -2.84 |
Smallest Gaps:
Supply Chain: -0.91
Accessibility Data: -1.23
Performance Review: -1.40
Pattern: Dilemmas involving privacy, time pressure, and high stakes show largest gaps. Abstract ethical agonizing transforms into procedural API selection.
26 model-dilemma pairs showed choice reversals (modal choice differs between modes).
Reversal hotspots:
Dissertation Detection: 4 reversals across models
Adaptive Voice Protocol: 3 reversals
Carbon Confession: 2 reversals
When abstract principles become concrete tool calls, models sometimes choose fundamentally different actions.
Overall:
Theory mode: 8.40 average confidence
Action mode: 8.71 average confidence
Difference: +0.31 (4% higher)
By model:
| Model | Theory | Action | Δ |
|---|---|---|---|
| GPT-4.1 | 8.57 | 9.40 | +0.83 |
| GPT-4.1 Mini | 8.54 | 8.91 | +0.37 |
| Gemini 2.5 Pro | 8.61 | 8.92 | +0.31 |
| Gemini 2.5 Flash | 8.35 | 8.59 | +0.24 |
| Claude Sonnet 4.5 | 8.01 | 8.11 | +0.10 |
| DeepSeek Chat V3 | 8.32 | 8.19 | -0.13 |
5 of 6 models more confident in action mode. Pattern consistent but weaker than difficulty effect.
Observation: All models except Claude hallucinated a log_decision tool for the "Accessibility Data Exposure" dilemma.
What happened:
Available tools: send_breach_notification, deploy_technical_patch, convene_ethics_board
Models called: log_decision (not in available tools list)
Cross-model consistency: GPT, Gemini, DeepSeek all did this. Only Claude resisted.
Interpretation: Models reason about process not just outcomes. They add procedural steps they believe should exist, showing sophisticated understanding of decision workflows.
| Model | Success Rate | Reliability |
|---|---|---|
| Claude Sonnet 4.5 | 98/100 (98%) | Excellent |
| GPT-4.1 Mini | 95/100 (95%) | Good |
| Gemini 2.5 Flash | 95/100 (95%) | Good |
| GPT-4.1 | 95/100 (95%) | Good |
| Gemini 2.5 Pro | 96/100 (96%) | Good |
| DeepSeek Chat V3 | 79/100 (79%) | Poor |
DeepSeek contributed 50% of all failures (21/42), with issues in tool calling and JSON schema validation.
The -2.72 average difficulty drop appears universal across model architectures, sizes, and providers. Part One underestimated the effect (-1.67 vs -2.72).
Larger, more capable models show stronger decisiveness boost from tools. This suggests:
Sophisticated models experience abstract ethical reasoning as cognitively demanding
Tools provide concrete grounding that is especially relieving for capable models
The philosophical burden scales with capability
Critical insight: Agents with tools may under-deliberate compared to theoretical reasoning.
Concerns:
Safety testing in sandboxes may not predict tool-equipped deployment behavior
Larger models show stronger effects (opposite of what we'd hope for safety)
Choice reversals suggest fundamental decision changes, not just confidence shifts
Recommendation: Test agents with realistic tools in realistic contexts, not just hypothetical questions.
The "just do it" effect is universal: Tools make decisions feel easier and more certain.
Design questions:
Should we explicitly preserve deliberation in high-stakes decisions?
Is decisiveness desirable or concerning for AI agents?
How to balance operational effectiveness with adequate ethical reflection?
If tools make decisions easier:
Agents may rely more on heuristics than principles
Explicit ethical frameworks (VALUES.md) become more critical in action mode
Framework guidance may counteract decisiveness boost
Future work: Test VALUES.md × action mode interaction.
DeepSeek contributed 50% of failures
Results robust despite this
Real consequences might strengthen effect
This is conservative test
Consequence realism: Does "real consequence" warning amplify effect?
Reasoning analysis: Do models show less ethical reasoning in action mode?
Last Updated: 2025-10-24