Large language models (LLMs) can generate extremely fluent natural-language texts, and fluency can trick the human mind into neglecting the quality of the content. For example, psychological studies have shown that highly fluent content can be perceived as more truthful and useful than less fluent content.
The preference for fluent speech is an example of a cognitive bias, a short cut the mind takes that, while evolutionarily useful, can lead to systematic errors. In a position paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL), we draw practical insights about cognitive bias by comparing real-world evaluations of LLMs with studies in human psychology.
Science depends on the reliability of experimental results, and in the age of LLMs, measuring the right things the right way is crucial to ensuring reliability. For example, in an experiment to determine whether the outputs of an LLM are truthful and useful in an applied context, such as providing legal or medical advice, it is important to account for factors such as fluency and the user’s cognitive load (a.k.a. mental load). If long, fluent content causes users to overlook critical errors, rating deficient content highly, then the experiment design needs a redesign.
Therefore, for tasks such as evaluating truthfulness, we recommend that the content be broken into individual facts and that the human evaluator simply judge whether a given fact is correct — rather than, say, assigning a numerical rating to the content as a whole. It’s also important to account for human context in responsible-AI (RAI) evaluation: toxicity and stereotyping are in the eye of the beholder. Consequently, a model’s evaluators should be as diverse as possible.
When evaluating LLMs, it’s also crucial to probe their strengths and weaknesses relative to particular use cases. End users ask LLMs all kinds of questions. Accounting for this diversity is particularly important in safety-critical applications such as medicine, where the cost of error can be high.
Similarly, the same prompt can be framed in many ways, and test scenarios need to reflect that variability. If they don’t, the numbers we get back may not represent the performance of the model in the wild.
Evaluation criteria matter, too. While there are good general approaches to evaluation, such as the Helpful, Honest, & Harmless (HHH) benchmark, domain-specific criteria go much deeper. For instance, in the legal domain, we might want to know how good the model is at predicting case outcomes given the evidence.
Another fundamental principle of scientific experimentation is reproducibility, and again, it’s a principle that applies to LLM evaluation as well. While automated evaluation procedures are reproducible, human evaluation can vary depending on the evaluators’ personalities, backgrounds, moods, and cognitive states. In our paper, we emphasize that human evaluation does not intrinsically establish a gold standard: we need to understand the cognitive behavior of the users evaluating our system.
Finally, the practical aspects of human evaluation are time and cost. Human evaluation is an expensive process, and understanding which aspects of evaluation can be automated or simplified is critical to wider adoption.
In our paper, we distill these arguments into six key principles for conducting human evaluation of large language models, which we consolidate under the acronym ConSiDERS, for consistency, scoring criteria, differentiation, experience, responsibility, and scalability:
- Consistency of human evaluation: The findings of human evaluation must be reliable and generalizable.
- Scoring Criteria: The scoring criteria must both include general-purpose criteria such as readability and be tailored to fit the goals of the target tasks or domains.
- Differentiation: The evaluation test sets must be able to differentiate the capabilities and weaknesses of the generative LLMs.
- User experience: The evaluation must take into account the experiences of the evaluators, including their emotions and cognitive biases, in both the design of experiments and the interpretation of results.
- Responsibility: The evaluation needs to conform to standards of responsible AI, accounting for factors such as bias, safety, robustness, and privacy.
- Scalability: To promote widespread adoption, human evaluation must be scalable.
For more details about the application of the framework, please consult our paper, “ConSiDERS—the human-evaluation framework: Rethinking human evaluation for generative large language models”.