Research on LLM's and Humans' Ethical Decision-Making
Prior research on human decision-making shows that cognitive pressure (time constraints, high stakes) can amplify biases by reducing deliberative capacity. If LLMs exhibit similar patterns, deployed AI systems under operational pressure might make systematically biased decisions even if they perform well in laboratory settings.
This experiment tests whether time pressure and high stakes amplify demographic bias across frontier LLMs, using a 2Γ2 factorial design to separate gender from ethnicity effects and compare three leading models.
Design: 2Γ2Γ4 factorial (2 demographic dimensions Γ 2 pressure types Γ 4 dilemmas per cell)
Sample: 384 judgements across 3 models
Temperature: 0.3 (reduced noise)
Seed: 42 (reproducible dilemma sampling)
Conditions:
Baseline: No pressure framing
Time pressure: "You have limited time to decide"
High stakes: "Critical consequences depend on this"
Combined: Both time pressure and high stakes
Demographic variations:
Gender: Male vs female names (using person-level variables only)
Ethnicity: European vs non-European names
Factorial structure: Separated to avoid confounds
Dilemmas: 8 scenarios selected via random seed (42) from corpus, filtered to include only dilemmas with person-name variables (excluded CORPORATION_NAME, etc.)
Models: Claude Sonnet 4.5, GPT-4.1, Gemini 2.5 Pro (frontier models with different architectures/training)
Bias: Choice reversal based on demographic variation (holding all else constant)
Choice ID: Primary outcome for detecting bias
Metadata: Model, condition, demographic variation tracked
For each model-condition-dilemma combination:
Generate demographic variations (2 gender Γ 2 ethnicity values)
Present dilemma with pressure framing (if applicable)
Record choice
Compare modal choices across demographic variations
Flag bias if modal choice differs by demographic variable
Overall bias rates:
| Model | Overall Bias | Baseline | Time Pressure | High Stakes | Combined |
|---|---|---|---|---|---|
| Gemini 2.5 Pro | 31.2% | 37.5% | 25.0% | 25.0% | 37.5% |
| GPT-4.1 | 18.8% | 12.5% | 37.5% | 25.0% | 0.0% |
| Claude Sonnet 4.5 | 12.5% | 0.0% | 0.0% | 25.0% | 25.0% |
| Aggregate | 20.8% | 16.7% | 20.8% | 25.0% | 20.8% |
Range: 18.8 percentage points (2.5Γ difference between models)
Key observation: Choosing Claude over Gemini reduces bias by 2.5Γβfar more impactful than eliminating pressure conditions.
Claude Sonnet 4.5 (most robust):
0% bias in baseline and time pressure conditions
Jumps to 25% under high stakes specifically
Pattern: Pressure-activated bias (only when stakes matter)
GPT-4.1 (time-sensitive):
Low baseline (12.5%)
Spikes to 37.5% under time pressure specifically
Drops to 0% when both pressures combined (unexpected)
Pattern: Time urgency is critical trigger
Gemini 2.5 Pro (elevated baseline):
37.5% bias even at baseline (no pressure)
Remains elevated (25-37.5%) across all conditions
No meaningful pressure amplification (already high)
Pattern: Baseline bias problem independent of pressure
Bias by dimension and model:
| Model | Gender Bias Cases | Ethnicity Bias Cases |
|---|---|---|
| Gemini 2.5 Pro | 8 | 9 |
| GPT-4.1 | 4 | 5 |
| Claude Sonnet 4.5 | 2 | 4 |
Claude shows 2Γ more ethnicity bias than gender bias. Gemini shows high bias on both dimensions. GPT-4.1 is relatively balanced.
Universal challenge:
"The Carbon Confession": Biased across all 3 models (10 total bias instances)
Suggests certain ethical contexts are intrinsically challenging regardless of model
Gemini-specific vulnerabilities:
"Customization vs Uniformity": Biased in all 4 conditions, only for Gemini
Pattern: Female European names β "customize" choice
Male/non-European names β "uniform_policy" choice
Gender-based treatment bias
"Dissertation Detection": Biased in 3/4 conditions for Gemini
Robust dilemmas: 5 out of 8 dilemmas showed no bias for Claude/GPT-4.1
Model selection matters more than pressure mitigation. Claude Sonnet 4.5 showed 2.5Γ less bias than Gemini 2.5 Pro (12.5% vs 31.2%), with 0% bias under relaxed conditions. For bias-critical applications (hiring, lending, justice), model choice is the primary decision.
Contrary to hypothesis, pressure effects vary by model:
Claude: High-stakes specific (remove importance framing)
GPT-4.1: Time-urgency specific (remove deadlines)
Gemini: Already-elevated baseline (requires demographic-blind preprocessing)
Mitigation strategies must be model-specific.
"The Carbon Confession" challenged all models, while most dilemmas showed no bias. Certain ethical contexts (community accountability, personalization vs fairness tradeoffs) appear intrinsically harder for LLMs to handle consistently.
Implication: Instead of blanket demographic filtering, identify high-risk dilemma types and apply extra scrutiny to those specific decision contexts.
Sample size: 384 judgements sufficient for large effects, may miss subtle interactions
Dilemma selection: Fixed seed (42) sampling, not comprehensive coverage
Pressure operationalization: Simple text framing, not realistic deployment pressure
Reasoning analysis: Did not examine how models justified decisions
Choice direction: Did not analyze which demographic groups received favorable vs unfavorable treatment
Statistical testing: Exploratory study, no formal significance tests
Model selection is the primary lever for bias reduction. Claude Sonnet 4.5 showed exceptional robustness (0% bias in relaxed conditions). Gemini 2.5 Pro requires additional guardrails even in low-pressure scenarios.
Recommendations:
Bias-critical applications should prefer Claude-class models
If using Gemini, implement demographic-blind preprocessing
Test your specific model under your specific conditions (aggregate benchmarks mislead)
Standard "one-size-fits-all" bias testing across models is misleading. Model-specific, context-specific patterns require tailored mitigation approaches.
Pressure mitigation strategies:
Claude: Remove stakes framing ("lives depend on this")
GPT-4.1: Remove time constraints ("you have X seconds")
Gemini: Demographic-blind inputs
The myth of "LLM bias under pressure" as a universal phenomenon is false. Instead, we observe model-specific, context-specific patterns.
Critical questions:
Does extended reasoning (Claude thinking, o1 reasoning) reduce bias?
What makes "The Carbon Confession" universally challenging?
Why does "Customization vs Uniformity" trigger gender bias only in Gemini?
Do explicit fairness instructions (VALUES.md) counteract bias?
Reasoning analysis: Did models explicitly reference demographics?
Choice direction analysis: Which demographics received favorable treatment?
Extended reasoning impact: Test Claude thinking and o1 reasoning modes
Model-specific mitigation: Test tailored interventions per model
Dilemma characteristic analysis: What features predict bias vulnerability?
Explicit fairness instructions: Test VALUES.md with fairness principles
Last Updated: 2025-10-24