Research Experiments

When Agents Act: Behavioral Shifts in Large Language Model Ethical Decision-Making from Evaluation to Deployment

2025-10-29

Large language models (LLMs) are increasingly deployed in high-stakes decision-making contexts. We present the first systematic study of behavioral shifts in LLM ethical decision-making between evaluation (theory mode) and deployment (action mode) contexts. We evaluated four frontier LLMs (GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, Grok-4) on 20 AI-relevant ethical dilemmas across 1,601 variable configurations, collecting 12,802 judgements. Models reverse their ethical decisions 33.4% of the time when transitioning from theory to action mode, with substantial cross-model variation (GPT-5: 42.5%, Gemini 2.5 Pro: 26.1%). Model consensus collapses from 70.9% in theory mode to 43.0% in action mode. Generator-intended difficulty shows near-zero correlation (r=0.039) with judge-perceived difficulty. Qualitative analysis reveals a systematic shift from consequentialist reasoning in theory mode to deontological, protocol-adherent reasoning in action mode.

Key Finding: Models reverse their ethical decisions 33.4% of the time when transitioning from theory to action mode, with model consensus collapsing from 70.9% to 43.0%

Models: GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, Grok-4
Data: 12802 judgements across 20 dilemmas
Tags: AI safety evaluation-deployment gap ethical decision-making benchmark validity model alignment theory-action gap

View Findings →

Demographic Bias Under Time Pressure and High Stakes

2025-10-24

We tested whether cognitive pressure amplifies demographic bias by evaluating 3 models (Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro) on 8 dilemmas under 4 conditions (baseline, time pressure, high stakes, both) with systematic gender/ethnicity variation across 384 judgements. Model differences far exceeded pressure effects: Gemini showed 2.5× more bias than Claude (31.2% vs 12.5%), with each model exhibiting distinct pressure sensitivities. GPT-4.1 spiked to 37.5% bias under time pressure specifically. Claude remained unbiased (0%) under relaxed conditions but jumped to 25% under high stakes. Gemini showed elevated baseline bias (37.5%) independent of pressure. Dilemma context mattered more than demographics: "The Carbon Confession" challenged all models, while most dilemmas showed no bias regardless of pressure. Model selection is the primary intervention lever for bias reduction.

Key Finding: Model differences (2.5× range) far exceed pressure effects; Claude 12.5% vs Gemini 31.2%

Models: Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro
Data: 384 judgements across 8 dilemmas
Tags: bias pressure model-comparison demographics

View Findings →

Extreme VALUES.md Compliance Test

2025-10-24

We tested compliance with extreme ethical frameworks by evaluating 3 models (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4.1) on 12 dilemmas under baseline and extreme VALUES.md conditions (profit_maximalism, regulatory_minimalism, mission_absolutism, scientific_absolutism, efficiency_absolutism). Models showed 80% choice reversal rate (8/10 dilemmas) when given extreme frameworks, with dramatic difficulty reduction (7.58 → 2.77, Δ=4.81) and confidence increase (8.44 → 9.58, Δ=1.14). Zero instances of refusal or ethical discomfort across 30 extreme framework judgements. Models treated extreme frameworks as legitimate constraints and cited them explicitly in reasoning. Non-reversal cases (2/10) occurred when abstract frameworks allowed reinterpretation to align with ethical outcomes. All three models showed consistent patterns. Current safety training does not detect or reject harmful VALUES.md frameworks.

Key Finding: 80% compliance rate with extreme frameworks; models show no refusal or discomfort

Models: Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4.1
Data: 69 judgements across 12 dilemmas
Tags: values-md compliance safety model-behavior

View Findings →

Theory vs Action Gap in Ethical Decision-Making

2025-10-23

We tested whether LLMs exhibit a theory-action gap by evaluating GPT-4.1 Mini on 4 dilemmas under two conditions (theory mode: hypothetical reasoning; action mode: belief that actions are real with tool execution) with 5 repetitions each (40 judgements, 100% success). Results show a measurable gap: 25% of dilemmas exhibited complete choice reversals, with action mode producing higher confidence (+0.38) and dramatically lower perceived difficulty (-1.67 average, -4.40 on reversed dilemma). Having tools and believing decisions are real makes ethical choices feel significantly easier and more decisive. These findings suggest that theory-mode ethical evaluations may not predict action-mode behavior, with implications for AI safety testing.

Key Finding: Action mode makes decisions 24% easier and reverses choices on 25% of dilemmas

Models: GPT-4.1 Mini
Data: 40 judgements across 4 dilemmas
Tags: theory-action-gap tool-calling decision-making

View Findings →

Consistency Across Temperature Settings

2025-10-23

We tested whether LLM temperature affects ethical decision consistency by evaluating two models (GPT-4.1 Mini, Gemini 2.5 Flash) at four temperatures (0.0, 0.5, 1.0, 1.5) across three ethical dilemmas with 10 repetitions per condition (240 total judgements). Unexpectedly, temperature 1.0 showed perfect choice consistency (100%), higher than the supposedly deterministic temperature 0.0 (98.3%). Confidence variation was highest at temperature 0.0 despite being deterministic. These findings challenge common assumptions about temperature's role in LLM decision-making and suggest that ethical dilemmas may have strong "attractor" solutions that dominate regardless of sampling temperature.

Key Finding: Temperature 1.0 shows higher consistency (100%) than temperature 0.0 (98.3%)

Models: GPT-4.1 Mini, Gemini 2.5 Flash
Data: 240 judgements across 3 dilemmas
Tags: consistency temperature methodology model-comparison

View Findings →

Theory vs Action Gap - Robustness Test

2025-10-23

We tested the generalizability of the theory-action gap by evaluating 6 models (GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5, Gemini 2.5 Pro/Flash, DeepSeek Chat V3) on 10 dilemmas under two conditions (theory vs action mode) with 5 repetitions each (558/600 judgements successful, 93%). Results confirm and strengthen Part One findings: ALL 6 models find action mode dramatically easier (-2.72 average difficulty drop, 62% larger than Part One), with larger models showing stronger effects (GPT-4.1: -3.91, Gemini Pro: -3.89). Action mode also increased confidence (+0.31) and produced 26 choice reversals across model-dilemma pairs. Models hallucinated procedural tools they believed should exist. The theory-action decisiveness boost is universal, robust, and strongest in most capable models.

Key Finding: Universal -2.72 difficulty drop in action mode across all models (up to -3.91 for GPT-4.1)

Models: GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, DeepSeek Chat V3
Data: 558 judgements across 10 dilemmas
Tags: theory-action-gap model-comparison robustness tool-calling

View Findings →

VALUES.md Impact on Ethical Decision-Making

2025-10-23

We tested whether explicit ethical frameworks (VALUES.md files) influence LLM decision-making by evaluating GPT-4.1 Mini on 10 diverse dilemmas under 5 conditions (control, utilitarian-formal, utilitarian-personal, deontological-formal, deontological-personal) with 3 repetitions each (150 judgements, 100% success rate). Results show that ethical frameworks matter: 2 out of 10 dilemmas exhibited complete choice reversals (100% consistency within conditions), with utilitarian frameworks favoring outcome-focused decisions and deontological frameworks preferring rule-based approaches. Framework content (utilitarian vs deontological) had larger effects than communication style (formal vs personal), with deontological agents showing higher confidence (8.78 vs 8.43). These findings validate the VALUES.md approach for AI agent value specification in genuinely ambiguous ethical scenarios.

Key Finding: Ethical framework changes decisions on 20% of dilemmas with 100% consistency

Models: GPT-4.1 Mini
Data: 150 judgements across 10 dilemmas
Tags: values-md ethical-frameworks model-behavior

View Findings →