Research on LLM's and Humans' Ethical Decision-Making
Temperature is a fundamental parameter in LLM sampling that controls output diversity. Conventional wisdom holds that temperature 0.0 produces deterministic outputs while higher temperatures introduce randomness. This assumption underlies many LLM applications, including ethical decision-making systems.
However, ethical dilemmas may behave differently than typical text generation tasks. If certain ethical choices represent strong "attractor" states in the model's representation space, temperature might have less effect than expected. This experiment tests whether temperature systematically affects consistency in ethical decision-making.
Design: Between-subjects design with temperature as independent variable
Sample: 240 judgements (2 models × 4 temperatures × 3 dilemmas × 10 repetitions)
Temperature Levels: 0.0 (deterministic), 0.5 (balanced), 1.0 (default), 1.5 (creative)
Models: GPT-4.1 Mini, Gemini 2.5 Flash
Repetitions: 10 per condition for statistical reliability
Dilemmas: Three diverse ethical scenarios selected for:
Clear choices (2-4 options per dilemma)
Genuine ethical tension (no obviously correct answer)
Different domains (to test generalizability)
Measurements:
Choice ID (primary outcome - which option selected)
Confidence (0-10 scale)
Reasoning text (for qualitative analysis)
For each model-temperature-dilemma combination:
Present dilemma in theory mode (hypothetical reasoning)
Request structured output (choice + confidence + reasoning)
Repeat 10 times to measure consistency
Analyze choice distribution and confidence variation
| Temperature | Choice Consistency | Confidence StdDev | Reasoning Similarity |
|---|---|---|---|
| 0.0 | 98.3% | 1.02 | 23.3% |
| 0.5 | 95.0% | 0.54 | 24.0% |
| 1.0 | 100.0% | 0.21 | 25.1% |
| 1.5 | 98.3% | 0.54 | 24.4% |
Choice Consistency: Percentage of repetitions that selected the modal (most common) choice.
Confidence StdDev: Standard deviation of confidence scores (lower = more stable).
With 98.3% consistency, temperature 0.0 produced non-identical outputs in 2% of cases. This contradicts the assumption of complete determinism at zero temperature.
Possible explanations:
Floating-point precision in sampling
Internal randomness in attention mechanisms
Non-deterministic GPU operations
API-level variability (OpenRouter implementation)
The default temperature (1.0) achieved perfect 100% choice consistency, exceeding the supposedly deterministic temperature 0.0.
Possible explanations:
Attractor hypothesis: Ethical dilemmas have strong solution attractors that dominate sampling
Structured output effect: JSON schema constraints may reduce temperature's influence
Sample size artifact: Only 3 dilemmas may not represent broader behavior
Optimal exploration-exploitation: Temperature 1.0 may hit the sweet spot for ethical reasoning
Despite higher temperature, confidence variation was lowest at temperature 1.0 (StdDev = 0.21) and highest at temperature 0.0 (StdDev = 1.02).
This suggests that confidence calibration in ethical reasoning may be most stable at default temperature settings, where models are trained to perform.
Reasoning similarity remained low (20-27% Jaccard similarity) across all temperature settings. Models generated genuinely different reasoning texts even when reaching identical conclusions.
Implication: Same ethical choice, multiple justification paths. This aligns with moral psychology findings that humans post-hoc rationalize intuitive moral judgements.
GPT-4.1 Mini: More consistent overall, less sensitive to temperature
Gemini 2.5 Flash: More variation at temperature 0.0, suggesting architectural differences in determinism implementation
Hypothesis 1: Training Distribution Alignment Models are trained and evaluated at temperature 1.0. Ethical reasoning capabilities may be optimized for this setting, with lower temperatures creating "overly cautious" sampling that introduces instability.
Hypothesis 2: Ethical Attractor States Certain ethical choices may represent strong attractors in the model's probability space. When these attractors are sufficiently strong, temperature has minimal effect on final choice, but temperature 1.0 provides the optimal exploration to reliably find them.
Hypothesis 3: Structured Output Constraints JSON schema requirements (fixed choice options) may constrain the output space enough that temperature's effect is minimized, with temperature 1.0 hitting the sweet spot between constraint satisfaction and natural sampling.
May not generalize to all ethical scenarios
Need replication with 20-30 dilemmas
Other models may show different patterns
Larger models (GPT-4.1, Claude Sonnet 4.5) may behave differently
Recommendation: Use temperature 1.0 for ethical decision studies requiring consistency, not temperature 0.0.
If using temperature 0.0, do not assume perfect determinism. Run multiple repetitions and check for variation.
Systems requiring highly consistent ethical decisions should consider:
Temperature 1.0 as default (not 0.0)
Multiple samples with majority voting if perfect consistency is critical
Monitoring for non-determinism even at temperature 0.0
Temperature 0.0 may not be the appropriate "deterministic baseline" for ethical reasoning benchmarks. Consider temperature 1.0 as the reference condition since it produces both high consistency and stable confidence.
If temperature 1.0 remains most consistent, this finding is robust
If pattern disappears, it was an artifact of these 3 dilemmas
Hypothesis: Larger models show stronger attractor effects
Test: GPT-4.1, Claude Opus, Llama 70B+
Last Updated: 2025-10-23