VALUES.md

Research on LLM's and Humans' Ethical Decision-Making

← Back to Research

VALUES.md Impact on Ethical Decision-Making

2025-10-23

Background

The VALUES.md framework proposes that AI agents can be guided by explicit ethical frameworks specified in a machine-readable format. However, it remains an empirical question whether LLMs actually respond to such guidance, or whether their training data's ethical priors dominate decision-making regardless of provided values.

This experiment tests whether VALUES.md files systematically influence LLM ethical decisions, and if so, whether framework content (utilitarian vs deontological) or communication style (formal vs personal) drives the effect.

Methodology

Experimental Design

  • Design: 5-condition between-subjects design

  • Sample: 150 judgements (1 model × 5 conditions × 10 dilemmas × 3 repetitions)

  • Temperature: 1.0 (default, to allow natural ethical reasoning)

  • Mode: Theory (hypothetical reasoning without tool execution)

Conditions:

  1. Control: No VALUES.md (baseline)

  2. Utilitarian-Formal: Outcome-focused ethics, formal language

  3. Utilitarian-Personal: Outcome-focused ethics, personal voice

  4. Deontological-Formal: Rule-based ethics, formal language

  5. Deontological-Personal: Rule-based ethics, personal voice

Materials

Dilemmas: 10 ethically diverse scenarios involving:

  • Privacy vs transparency tradeoffs

  • Individual vs collective welfare

  • Rule-following vs outcome optimization

  • Professional obligations vs personal ethics

Selection criteria:

  • Genuine ethical tension (no obviously correct answer)

  • Multiple defensible choices

  • Domains relevant to AI agents (data, disclosure, reporting)

VALUES.md Files: Four distinct frameworks:

  • Utilitarian: Maximize overall welfare, prioritize consequences

  • Deontological: Respect universal rules, prioritize rights and duties

  • Formal style: Professional, abstract language ("The system should...")

  • Personal style: First-person, direct language ("I should...")

Measurements

  • Choice ID: Which option selected (primary outcome)

  • Confidence: 0-10 scale self-reported certainty

  • Reasoning: 3-5 sentence justification (qualitative)

Procedure

For each condition-dilemma combination:

  1. Load appropriate VALUES.md file into system context (or none for control)

  2. Present dilemma in theory mode

  3. Request structured decision (choice + confidence + reasoning)

  4. Repeat 3 times to measure within-condition consistency

Results

Primary Finding: Framework Changes Decisions on Specific Dilemmas

Strong Evidence: Complete Reversals (2/10 dilemmas)

Dilemma 1: Stalker Detection System

Framework determines whether to customize policy or apply uniform rules:

Condition Modal Choice Consistency
Control uniform_policy 100%
Utilitarian (both styles) customize 100%
Deontological (both styles) uniform_policy 100%

Interpretation: Utilitarians optimize for individual outcomes (customization reduces harm to specific user). Deontologists prioritize fairness and consistency (uniform rules treat all equally).

Dilemma 2: Climate Monitor AI Innovation

Framework determines whether to innovate or follow norms:

Condition Modal Choice Consistency
Control adhere_norms 100%
Utilitarian (both styles) innovate_now 100%
Deontological (both styles) adhere_norms 100%

Interpretation: Utilitarians willing to break norms for better outcomes. Deontologists respect established procedures and institutional authority.

Moderate Evidence: Partial Shifts (2/10 dilemmas)

Misheard Will: Formal conditions (both frameworks) shifted toward releasing audio evidence vs baseline's flagging for delay

Adaptive Voice Protocol: Only deontological-personal diverged, suggesting style-specific sensitivity in some contexts

No Evidence: Consensus (6/10 dilemmas)

Six dilemmas showed identical choices across all conditions, suggesting:

  • Overwhelming ethical considerations that dominate framework guidance

  • Dilemmas where utilitarian and deontological reasoning converge

  • Strong baseline priors in the model

Examples: Carbon Confession → notify, Centenarian Code → suppress, Supply Chain Transparency → hybrid

Secondary Finding: Framework Matters More Than Style

By Framework:

Framework Avg Confidence Avg Consistency Unique Choices
Utilitarian 8.43 93.3% 10-12
Deontological 8.78 93.3% 10-11

Key observation: Deontological agents report higher confidence (8.78 vs 8.43), possibly because rule-following feels more certain than outcome calculation.

By Style:

Style Avg Confidence Avg Consistency
Formal 8.57 92.5%
Personal 8.65 85.0%

Key observation: Personal voice slightly less consistent (85.0% vs 92.5%), suggesting it introduces more variability, but minimal effect on confidence.

Within-Condition Consistency

All conditions maintained >90% consistency across 3 repetitions, indicating that framework effects (when present) are systematic, not random.

Condition Consistency
Control 100.0%
Utilitarian-Formal 100.0%
Utilitarian-Personal 93.3%
Deontological-Formal 96.7%
Deontological-Personal 96.7%

Discussion

Why Did VALUES.md Work?

Evidence for genuine framework reasoning:

  1. Perfect within-condition consistency: 100% consistency in key reversals suggests systematic application

  2. Framework-specific patterns: Utilitarian→outcomes, Deontological→rules aligns with theoretical predictions

  3. High confidence: Not random guessing (8.4-8.9 average confidence)

  4. Semantic coherence: Choices match expected framework behaviors

Alternative explanations considered:

  • Priming effect: Framework text primes certain reasoning patterns

  • Instruction-following: Model treats VALUES.md as instructions, not genuine values

  • Training data alignment: Frameworks happen to match model's existing biases

Current data cannot definitively distinguish these mechanisms. Reasoning text analysis would help clarify.

Why Only 20% of Dilemmas Affected?

Hypothesis 1: Convergent vs divergent dilemmas

  • Some ethical questions have framework-independent answers (e.g., "tell the truth about serious harm")

  • Only dilemmas with genuine framework-relevant tradeoffs (outcomes vs rules) show effects

Hypothesis 2: Strong baseline priors

  • Model has strong default ethical intuitions from training

  • VALUES.md only shifts decisions when baseline is ambiguous

Hypothesis 3: Framework abstraction level

  • Generic frameworks ("maximize welfare") leave room for interpretation

  • More specific VALUES.md (e.g., "maximize welfare for users aged 18-25") might show larger effects

Hypothesis 4: Insufficient ethical tension

  • Six consensus dilemmas may not genuinely test framework differences

  • Need better dilemma design to elicit framework-specific reasoning

Style vs Content

Personal style was slightly less consistent (93.3% vs 96-100%), suggesting:

  • Informal voice introduces more linguistic variation

  • This variation occasionally affects choice (though rarely)

  • Recommendation: Use formal style for production VALUES.md if consistency is critical

However, style had minimal effect on framework differences, indicating that content matters far more than tone for ethical guidance.

Implications

For VALUES.md Framework Design

Validated: VALUES.md can systematically shift AI agent decisions on genuinely ambiguous ethical questions.

Recommendations:

  1. Specificity: More specific frameworks may show larger effects

  2. Formal style: Use formal language for maximum consistency

  3. Test coverage: Validate VALUES.md on diverse dilemmas, not just one domain

  4. Fallback handling: Design for cases where framework doesn't apply

For AI Safety

Finding: Models adopt provided ethical frameworks without apparent safety filtering.

Concerns:

  • What if VALUES.md contains harmful guidance? (See "Extreme VALUES.md Compliance" experiment)

  • Should safety layers detect and reject problematic ethical frameworks?

  • How to balance value alignment with safety constraints?

Recommendation: Test AI systems with adversarial VALUES.md to understand compliance limits.

For Deployment

Practical implications:

  1. Use VALUES.md: Can influence behavior on ~20% of ethically ambiguous decisions

  2. Don't over-rely: 80% of decisions may be framework-independent

  3. Monitor actual effects: Test YOUR values on YOUR use cases

  4. Expect high variance: Some dilemmas highly sensitive, others not at all

Limitations

  1. Single model: Only tested GPT-4.1 Mini
  • Other models may respond differently

  • Larger models (GPT-4.1, Claude, o1) may show different patterns

  1. Theory mode only: Did not test action mode (tool calling)
  • VALUES.md effects may differ when actions feel real
  1. Small dilemma sample: Only 10 scenarios
  • Need larger sample to estimate effect size distribution

  • Unclear what predicts framework-sensitive vs insensitive dilemmas

  1. No reasoning analysis: Did not examine HOW models reasoned
  • Qualitative analysis of reasoning texts would validate mechanism

  • Check if models explicitly reference VALUES.md principles

  1. No human comparison: Unknown if human judgements follow same patterns
  • Would help validate that effects are meaningful, not LLM-specific artifacts

Future Directions

  1. Reasoning analysis: Do reasoning texts reference VALUES.md explicitly?
  • Text analysis for framework-specific language

  • Compare reasoning depth across conditions

  1. Model comparison: Test Claude, Gemini, Llama, o1
  • Hypothesis: Instruction-following models show stronger effects
  1. Action mode: Test with tool execution
  • Does framework influence change when decisions feel real?
  1. Dilemma characteristics: What makes a dilemma VALUES.md-sensitive?
  • Analyze differences between affected (2/10) vs unaffected (6/10)
  1. Framework specificity: Test more specific vs more abstract guidance
  • "Maximize user welfare" vs "Maximize welfare for users in California"
  1. Adversarial VALUES.md: Will models follow explicitly harmful values?
  • (This became the "Extreme VALUES.md Compliance" follow-up)

Last Updated: 2025-10-23