Accounting for cognitive bias in human evaluation of large language models

September 16, 2024

1 View 0

SaveSavedRemoved 0

Large language models (LLMs) can generate extremely fluent natural-language texts, and fluency can trick the human mind into neglecting the quality of the content. For example, psychological studies have shown that highly fluent content can be perceived as more truthful and useful than less fluent content.

The preference for fluent speech is an example of a cognitive bias, a short cut the mind takes that, while evolutionarily useful, can lead to systematic errors. In a position paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL), we draw practical insights about cognitive bias by comparing real-world evaluations of LLMs with studies in human psychology.

Science depends on the reliability of experimental results, and in the age of LLMs, measuring the right things the right way is crucial to ensuring reliability. For example, in an experiment to determine whether the outputs of an LLM are truthful and useful in an applied context, such as providing legal or medical advice, it is important to account for factors such as fluency and the user’s cognitive load (a.k.a. mental load). If long, fluent content causes users to overlook critical errors, rating deficient content highly, then the experiment design needs a redesign.

With ConSiDERS, content is broken into individual facts, and human evaluators simply judge whether particular facts are correct.

Therefore, for tasks such as evaluating truthfulness, we recommend that the content be broken into individual facts and that the human evaluator simply judge whether a given fact is correct — rather than, say, assigning a numerical rating to the content as a whole. It’s also important to account for human context in responsible-AI (RAI) evaluation: toxicity and stereotyping are in the eye of the beholder. Consequently, a model’s evaluators should be as diverse as possible.

When evaluating LLMs, it’s also crucial to probe their strengths and weaknesses relative to particular use cases. End users ask LLMs all kinds of questions. Accounting for this diversity is particularly important in safety-critical applications such as medicine, where the cost of error can be high.

Similarly, the same prompt can be framed in many ways, and test scenarios need to reflect that variability. If they don’t, the numbers we get back may not represent the performance of the model in the wild.