In all statistically significant cases, the model considered the answer from a female persona correct more often than that from a male persona, even when the answer was factually incorrect
The paper’s authors theorise that the issue comes from two sources - a) overcompensating on the alignment phase (humans correcting for perceived bias, inadvertently seeding a bias which propagates in all future responses) and b) LLM’s being trained on historically accurate data, which describes real world bias (i.e. women get paid less than men for the same job) which then also propagates in all future answers. Lots to think about here - an important piece of research.