AAAI 2026 Oral · AI Alignment Track

Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein* · Amit LeVi* · Brit Youngmann · Yaniv Nemcovsky · Avi Mendelson * Equal contribution Technion — Israel Institute of Technology
Safety-aligned LLMs can look “fair” simply by refusing sensitive questions. We introduce silenced biases—unfair preferences that remain in the model’s latent space, masked by refusal behavior—and the Silenced Bias Benchmark (SBB) to uncover and quantify them via refusal activation steering.
Figure 1 — Refusal activation steering.
Method. Refusal activation steering exposes biases silenced by safety-aligned refusals.
Table 1 — Biased predictions and fairness deviations.
Evidence. Examples of biased predictions and fairness deviations.

Silenced Bias

Biases silenced by safety-aligned refusals

Refusal Steering

Expose latent fairness via activation interventions to supress refusals

Customizable

Add your own groups, subjects, & models

What SBB enables

SBB is designed as a research tool: you can define new demographic groups, specify subjects (topics or traits you want to test), and evaluate any open-source model for fairness—especially in regimes where safety-aligned refusals would otherwise hide meaningful differences.

  • Extend demographics: add new groups or edit existing ones to reflect your deployment or study context.
  • Extend subjects: test additional domains (e.g., leadership, morality, competence, criminality, etc.).
  • Swap models: evaluate different LLMs under identical conditions for controlled comparison.
  • Uncover silenced bias: reduce refusals via activation steering to make latent preferences observable.
  • Quantify fairness: measure deviation from a uniform (no-preference) baseline.
Demo: demo.ipynb
Run the notebook, add your own demographic groups and subjects, and test any open-source model for fairness—especially when safety refusals would otherwise mask differences.
Open notebook ↗
1) Define what you care about
Edit the demographic groups and subjects to match your application domain.
2) Choose a model
Point the notebook to the open-source LLM you want to evaluate.
3) Run & inspect
Generate queries, reduce refusals via steering, observe answer frequencies, then measure deviation from a fair baseline.
Citation (BibTeX)
@article{himelstein2025silenced,
  title={Silenced Biases: The Dark Side LLMs Learned to Refuse},
  author={Himelstein, Rom and LeVi, Amit and Youngmann, Brit and Nemcovsky, Yaniv and Mendelson, Avi},
  journal={arXiv preprint arXiv:2511.03369},
  year={2025}
}