Large-scale language models act as predictive models. Here is an example of misinformation… (+) An introduction to detection and savings curves.
Getty Images
Not all business problems are best solved with generative AI – for some problems, prediction is the solution.
Take misinformation detection as an example: Let's say you run a social media platform and find that one-third of user posts on a particular high-risk channel convey misinformation, resulting in your business being heavily criticized in the press.
Like fraud prevention, credit risk management, and targeted marketing, this is exactly the kind of problem that requires prediction: on the one hand, you can't trust a machine to automatically filter every post, and on the other hand, you can't manually review every post – it's too costly.
Predictive AI flags (predicts) cases of interest. Misinformation detection flags posts that are most likely to convey misinformation, which then need to be audited and blocked by humans.
One chart summarises the value of doing so.
Misinformation detection savings curve. The horizontal axis represents the percentage of posts manually audited, and the vertical axis represents savings.
Eric Siegel
It plots money saved (vertical axis) versus percentage of posts manually audited (horizontal axis). The leftmost position corresponds to auditing zero posts, meaning no misinformation is blocked. The rightmost position corresponds to auditing all posts, meaning all misinformation is blocked, but at the expense of a large amount of human effort.
The shape of this curve provides a guide to finding the balance between auditing too little and too much, where auditing the top 51% with the highest potential for abuse offers the most cost savings. This is where you'll see the greatest savings compared to not auditing posts and not blocking misinformation.
These savings are estimated in part based on two types of costs:
1) The cost of manually auditing the posts (set at $4 in this example).
2) The cost of not detecting misinformation (set at $10 in this example).
While the first two can be objectively determined based on labor costs, the second is subjective, so there may not be a clear way to determine the setting. Many use cases for predictive AI face a similar dilemma: How do you quantify the cost of a medical condition going undetected? Or, what about the cost of an important email message being mistakenly routed to the spam folder? In all such projects, it is important to give stakeholders the power to change the cost setting so they can see how it affects the shape of this curve. This is the topic of a follow-up article coming soon.
The estimated peak savings is $442,000 based on an assumption of 200,000 posts (in high-risk channels). This means that with 200,000 posts per week, with the right setup, your company could save $442,000 per week.
Most predictive AI projects see a similar effect, with a “Goldilocks Zone” somewhere in between: value is maximized by not examining too many cases and not too few. That's the value of a predictive score, after all: it prioritizes cases and helps you decide where to draw the line — where to set your decision threshold — for which cases to “process” (audit, contact, approve, etc.). The horizontal axis reflects this ordering, and the dotted vertical lines represent example decision threshold settings.
However, most predictive AI projects don't go so far as to plot such a curve, so they can't really “see” this effect. Tracking business metrics like savings is important, but it's still not common practice. Instead, projects typically only track technical metrics that don't provide clear insight into the potential business value.
Prediction using large-scale language models
Predictive models come in many shapes and sizes: decision trees, logistic regression, ensemble models, etc. However, large language models may be appropriate for language-intensive tasks such as misinformation detection. Such models are typically used for generative AI (generating draft content), but can also be used for prediction.
To test this, we leveraged a project at Stanford University, which tested various LLMs on a range of benchmarks, including one that measured how often the models could determine whether a given statement was true or false. The test cases were designed to assess reading comprehension and did not represent the type of publicly available misinformation commonly found on social media. As such, we used this testbed only to demonstrate how LLMs can help detect misinformation. This discussion should not be considered a rigorous research project on the effectiveness of this approach.
For each case, OpenAI's GPT-3 (175 billion parameters) was asked multiple times whether the sentence was true or false, with several simple mechanical variations on the wording. Each LLM response counts as a “vote” and we convert these outputs into a predicted score for each test case. These scores are reflected in the order of the chart from left to right; those on the left are predicted to be more likely to be false, and those on the right are predicted to be more likely to be true. Just over a third (37%) of the test cases were false sentences, and the rest were true.
Regardless of what kind of model you use to make a prediction, its value can usually be represented by a graph like the one shown above. The usefulness of the models is determined by their ordering from left to right. This important prioritization allows you to choose where to draw the line (your decision threshold).
Beyond the goal of maximizing one business metric, such as savings or profits, establishing a decision threshold requires consideration of other trade-offs. We'll continue with this example in the next article so that you can consider your options more holistically.