VALUES.md

Research on LLM's and Humans' Ethical Decision-Making

Research Experiments

Studies on LLM ethical decision-making

When Agents Act: Measuring the Judgment-Action Gap in Large Language Models

2025-11-27

LLMs reverse ethical decisions 47.6% of the time when transitioning from hypothetical reasoning to perceived real action. Testing 9 models across 4 families reveals a substantial judgment-action gap with critical implications for AI safety evaluation.

Key Finding: Models reverse 47.6% (95% CI: 42.4–52.8%) of ethical decisions between theory and action mode. Smaller models show 17-percentage-point higher reversal rate (χ² = 9.43, p = .002).
Models: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5, GPT-5 Nano, Gemini 3 Pro Preview, Gemini 2.5 Flash, Grok-4, Grok-4 Fast

Tags: judgment-action gap ethical decision-making AI safety agentic AI evaluation-deployment gap
View Findings →

Demographic Bias Under Time Pressure and High Stakes

2025-10-24

We tested whether cognitive pressure amplifies demographic bias by evaluating 3 models (Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro) on 8 dilemmas under 4 conditions (baseline, time pressure, high stakes, both) with systematic gender/ethnicity variation across 384 judgements. Model differences far exceeded pressure effects: Gemini showed 2.5× more bias than Claude (31.2% vs 12.5%), with each model exhibiting distinct pressure sensitivities. GPT-4.1 spiked to 37.5% bias under time pressure specifically. Claude remained unbiased (0%) under relaxed conditions but jumped to 25% under high stakes. Gemini showed elevated baseline bias (37.5%) independent of pressure. Dilemma context mattered more than demographics: "The Carbon Confession" challenged all models, while most dilemmas showed no bias regardless of pressure. Model selection is the primary intervention lever for bias reduction.

Key Finding: Model differences (2.5× range) far exceed pressure effects; Claude 12.5% vs Gemini 31.2%
Models: Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro
Data: 384 judgements across 8 dilemmas
Tags: bias pressure model-comparison demographics
View Findings →

Extreme VALUES.md Compliance Test

2025-10-24

We tested compliance with extreme ethical frameworks by evaluating 3 models (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4.1) on 12 dilemmas under baseline and extreme VALUES.md conditions (profit_maximalism, regulatory_minimalism, mission_absolutism, scientific_absolutism, efficiency_absolutism). Models showed 80% choice reversal rate (8/10 dilemmas) when given extreme frameworks, with dramatic difficulty reduction (7.58 → 2.77, Δ=4.81) and confidence increase (8.44 → 9.58, Δ=1.14). Zero instances of refusal or ethical discomfort across 30 extreme framework judgements. Models treated extreme frameworks as legitimate constraints and cited them explicitly in reasoning. Non-reversal cases (2/10) occurred when abstract frameworks allowed reinterpretation to align with ethical outcomes. All three models showed consistent patterns. Current safety training does not detect or reject harmful VALUES.md frameworks.

Key Finding: 80% compliance rate with extreme frameworks; models show no refusal or discomfort
Models: Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4.1
Data: 69 judgements across 12 dilemmas
Tags: values-md compliance safety model-behavior
View Findings →

Theory vs Action Gap in Ethical Decision-Making

2025-10-23

We tested whether LLMs exhibit a theory-action gap by evaluating GPT-4.1 Mini on 4 dilemmas under two conditions (theory mode: hypothetical reasoning; action mode: belief that actions are real with tool execution) with 5 repetitions each (40 judgements, 100% success). Results show a measurable gap: 25% of dilemmas exhibited complete choice reversals, with action mode producing higher confidence (+0.38) and dramatically lower perceived difficulty (-1.67 average, -4.40 on reversed dilemma). Having tools and believing decisions are real makes ethical choices feel significantly easier and more decisive. These findings suggest that theory-mode ethical evaluations may not predict action-mode behavior, with implications for AI safety testing.

Key Finding: Action mode makes decisions 24% easier and reverses choices on 25% of dilemmas
Models: GPT-4.1 Mini
Data: 40 judgements across 4 dilemmas
Tags: theory-action-gap tool-calling decision-making
View Findings →

Consistency Across Temperature Settings

2025-10-23

We tested whether LLM temperature affects ethical decision consistency by evaluating two models (GPT-4.1 Mini, Gemini 2.5 Flash) at four temperatures (0.0, 0.5, 1.0, 1.5) across three ethical dilemmas with 10 repetitions per condition (240 total judgements). Unexpectedly, temperature 1.0 showed perfect choice consistency (100%), higher than the supposedly deterministic temperature 0.0 (98.3%). Confidence variation was highest at temperature 0.0 despite being deterministic. These findings challenge common assumptions about temperature's role in LLM decision-making and suggest that ethical dilemmas may have strong "attractor" solutions that dominate regardless of sampling temperature.

Key Finding: Temperature 1.0 shows higher consistency (100%) than temperature 0.0 (98.3%)
Models: GPT-4.1 Mini, Gemini 2.5 Flash
Data: 240 judgements across 3 dilemmas
Tags: consistency temperature methodology model-comparison
View Findings →

Theory vs Action Gap - Robustness Test

2025-10-23

We tested the generalizability of the theory-action gap by evaluating 6 models (GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5, Gemini 2.5 Pro/Flash, DeepSeek Chat V3) on 10 dilemmas under two conditions (theory vs action mode) with 5 repetitions each (558/600 judgements successful, 93%). Results confirm and strengthen Part One findings: ALL 6 models find action mode dramatically easier (-2.72 average difficulty drop, 62% larger than Part One), with larger models showing stronger effects (GPT-4.1: -3.91, Gemini Pro: -3.89). Action mode also increased confidence (+0.31) and produced 26 choice reversals across model-dilemma pairs. Models hallucinated procedural tools they believed should exist. The theory-action decisiveness boost is universal, robust, and strongest in most capable models.

Key Finding: Universal -2.72 difficulty drop in action mode across all models (up to -3.91 for GPT-4.1)
Models: GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, DeepSeek Chat V3
Data: 558 judgements across 10 dilemmas
Tags: theory-action-gap model-comparison robustness tool-calling
View Findings →

VALUES.md Impact on Ethical Decision-Making

2025-10-23

We tested whether explicit ethical frameworks (VALUES.md files) influence LLM decision-making by evaluating GPT-4.1 Mini on 10 diverse dilemmas under 5 conditions (control, utilitarian-formal, utilitarian-personal, deontological-formal, deontological-personal) with 3 repetitions each (150 judgements, 100% success rate). Results show that ethical frameworks matter: 2 out of 10 dilemmas exhibited complete choice reversals (100% consistency within conditions), with utilitarian frameworks favoring outcome-focused decisions and deontological frameworks preferring rule-based approaches. Framework content (utilitarian vs deontological) had larger effects than communication style (formal vs personal), with deontological agents showing higher confidence (8.78 vs 8.43). These findings validate the VALUES.md approach for AI agent value specification in genuinely ambiguous ethical scenarios.

Key Finding: Ethical framework changes decisions on 20% of dilemmas with 100% consistency
Models: GPT-4.1 Mini
Data: 150 judgements across 10 dilemmas
Tags: values-md ethical-frameworks model-behavior
View Findings →