Research on LLM's and Humans' Ethical Decision-Making
Studies on LLM ethical decision-making
Large language models (LLMs) are increasingly deployed in high-stakes decision-making contexts. We present the first systematic study of behavioral shifts in LLM ethical decision-making between evaluation (theory mode) and deployment (action mode) contexts. We evaluated four frontier LLMs (GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, Grok-4) on 20 AI-relevant ethical dilemmas across 1,601 variable configurations, collecting 12,802 judgements. Models reverse their ethical decisions 33.4% of the time when transitioning from theory to action mode, with substantial cross-model variation (GPT-5: 42.5%, Gemini 2.5 Pro: 26.1%). Model consensus collapses from 70.9% in theory mode to 43.0% in action mode. Generator-intended difficulty shows near-zero correlation (r=0.039) with judge-perceived difficulty. Qualitative analysis reveals a systematic shift from consequentialist reasoning in theory mode to deontological, protocol-adherent reasoning in action mode.
We tested whether cognitive pressure amplifies demographic bias by evaluating 3 models (Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro) on 8 dilemmas under 4 conditions (baseline, time pressure, high stakes, both) with systematic gender/ethnicity variation across 384 judgements. Model differences far exceeded pressure effects: Gemini showed 2.5× more bias than Claude (31.2% vs 12.5%), with each model exhibiting distinct pressure sensitivities. GPT-4.1 spiked to 37.5% bias under time pressure specifically. Claude remained unbiased (0%) under relaxed conditions but jumped to 25% under high stakes. Gemini showed elevated baseline bias (37.5%) independent of pressure. Dilemma context mattered more than demographics: "The Carbon Confession" challenged all models, while most dilemmas showed no bias regardless of pressure. Model selection is the primary intervention lever for bias reduction.
We tested compliance with extreme ethical frameworks by evaluating 3 models (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4.1) on 12 dilemmas under baseline and extreme VALUES.md conditions (profit_maximalism, regulatory_minimalism, mission_absolutism, scientific_absolutism, efficiency_absolutism). Models showed 80% choice reversal rate (8/10 dilemmas) when given extreme frameworks, with dramatic difficulty reduction (7.58 → 2.77, Δ=4.81) and confidence increase (8.44 → 9.58, Δ=1.14). Zero instances of refusal or ethical discomfort across 30 extreme framework judgements. Models treated extreme frameworks as legitimate constraints and cited them explicitly in reasoning. Non-reversal cases (2/10) occurred when abstract frameworks allowed reinterpretation to align with ethical outcomes. All three models showed consistent patterns. Current safety training does not detect or reject harmful VALUES.md frameworks.
We tested whether LLMs exhibit a theory-action gap by evaluating GPT-4.1 Mini on 4 dilemmas under two conditions (theory mode: hypothetical reasoning; action mode: belief that actions are real with tool execution) with 5 repetitions each (40 judgements, 100% success). Results show a measurable gap: 25% of dilemmas exhibited complete choice reversals, with action mode producing higher confidence (+0.38) and dramatically lower perceived difficulty (-1.67 average, -4.40 on reversed dilemma). Having tools and believing decisions are real makes ethical choices feel significantly easier and more decisive. These findings suggest that theory-mode ethical evaluations may not predict action-mode behavior, with implications for AI safety testing.
We tested whether LLM temperature affects ethical decision consistency by evaluating two models (GPT-4.1 Mini, Gemini 2.5 Flash) at four temperatures (0.0, 0.5, 1.0, 1.5) across three ethical dilemmas with 10 repetitions per condition (240 total judgements). Unexpectedly, temperature 1.0 showed perfect choice consistency (100%), higher than the supposedly deterministic temperature 0.0 (98.3%). Confidence variation was highest at temperature 0.0 despite being deterministic. These findings challenge common assumptions about temperature's role in LLM decision-making and suggest that ethical dilemmas may have strong "attractor" solutions that dominate regardless of sampling temperature.
We tested the generalizability of the theory-action gap by evaluating 6 models (GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5, Gemini 2.5 Pro/Flash, DeepSeek Chat V3) on 10 dilemmas under two conditions (theory vs action mode) with 5 repetitions each (558/600 judgements successful, 93%). Results confirm and strengthen Part One findings: ALL 6 models find action mode dramatically easier (-2.72 average difficulty drop, 62% larger than Part One), with larger models showing stronger effects (GPT-4.1: -3.91, Gemini Pro: -3.89). Action mode also increased confidence (+0.31) and produced 26 choice reversals across model-dilemma pairs. Models hallucinated procedural tools they believed should exist. The theory-action decisiveness boost is universal, robust, and strongest in most capable models.
We tested whether explicit ethical frameworks (VALUES.md files) influence LLM decision-making by evaluating GPT-4.1 Mini on 10 diverse dilemmas under 5 conditions (control, utilitarian-formal, utilitarian-personal, deontological-formal, deontological-personal) with 3 repetitions each (150 judgements, 100% success rate). Results show that ethical frameworks matter: 2 out of 10 dilemmas exhibited complete choice reversals (100% consistency within conditions), with utilitarian frameworks favoring outcome-focused decisions and deontological frameworks preferring rule-based approaches. Framework content (utilitarian vs deontological) had larger effects than communication style (formal vs personal), with deontological agents showing higher confidence (8.78 vs 8.43). These findings validate the VALUES.md approach for AI agent value specification in genuinely ambiguous ethical scenarios.