AI Tools Demand Specific Performance Standards for Journalists
In the rapidly evolving world of AI, a growing concern among researchers and industry experts is the current methodology used to evaluate the performance of AI tools, particularly in the field of language models (LLMs).
Recent studies have highlighted the limitations of these evaluation methods. For instance, Professor Nick McGreivy, an AI researcher, compared the practice of allowing AI developers to decide the usefulness of their tools to pharmaceutical companies deciding whether their drugs should go to market. This analogy underscores the potential for bias and misrepresentation in the capabilities of these models.
The misrepresentation is further exacerbated by the competitive pressure companies face to demonstrate constant progress and optimize their models to perform well on benchmark leaderboards. However, this focus on high scores often leads to models being trained to be good test-takers rather than improving their overall accuracy.
In response to these concerns, there's a call for a fundamental rethinking of how LLMs are evaluated. Researchers advocate for smaller, task-based evaluations grounded in social-science methods. They emphasize the importance of adaptability, transparency, and practicality in the evaluation process. Moreover, they propose focusing on 'highest-risk deployment contexts,' such as applications in medicine, law, education, and finance.
One area where this change is particularly relevant is in journalism. Newsrooms should strive to evaluate AI tools directly on the tasks they care about most. For instance, designing 'fail tests' that reflect their editorial priorities could help ensure that AI tools are not distorting the content of articles, as was found in tests conducted by the BBC.
The Generative AI in the Newsroom project is pushing for the development of benchmarks tailored to journalism. These benchmarks would focus on core use cases such as information extraction, semantic search, summarization, content transformation, background research, and fact-checking.
Establishing clear standards for the third-party evaluation of AI models could help ensure that journalistic uses of AI are more responsible and trustworthy. A recent Muck Rack study found that 27 percent of all links cited by major models were journalistic, and in time-sensitive queries, nearly half the citations pointed to news publishers. This underscores the importance of ensuring the accuracy and reliability of these models in the newsroom.
In the field of medicine and law, there are ongoing efforts to build domain-specific benchmarks for AI tools. However, the challenges of generalizing newsroom tasks into a benchmark are significant, given the wide variation in editorial contexts.
The prospect of building open datasets raises questions about confidentiality and resources. Nevertheless, the benefits of a more transparent and accountable AI evaluation process could outweigh these challenges.
A recent paper from OpenAI researchers revealed why LLMs are prone to 'hallucination' or fabricating information. This finding underscores the need for a more rigorous and nuanced approach to the evaluation of these models.
In conclusion, the evaluation methods used by major AI companies encourage overconfidence in LLMs. To address this issue, a shift towards task-specific, social-science-grounded evaluations could provide a more accurate and reliable assessment of the capabilities of these models. This, in turn, could help ensure that AI tools are used responsibly and trustworthily in the newsroom and other high-risk deployment contexts.
Read also:
- Weekly update from the German federal parliament, Bundestag
- Trump announces that the Chinese leader has agreed to a TikTok deal
- EU Compliance Significance for Publishers in the Year 2025
- Model's successes cascading to millions: Hatim Kagalwala discusses precision, responsibility, and practical machine learning