Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models 2025

https://drive.google.com/file/d/1PIGPpxK7aqe7IrWzwe2JYEXQdWnWAbru/view?usp=sharing
D&I
Issue #472
26 Oct 2025

In all statistically significant cases, the model considered the answer from a female persona correct more often than that from a male persona, even when the answer was factually incorrect

The paper’s authors theorise that the issue comes from two sources - a) overcompensating on the alignment phase (humans correcting for perceived bias, inadvertently seeding a bias which propagates in all future responses) and b) LLM’s being trained on historically accurate data, which describes real world bias (i.e. women get paid less than men for the same job) which then also propagates in all future answers. Lots to think about here - an important piece of research.