OpenAI Says GPT-4 Is Great For Content Moderation, But Seems… A Bit Too Trusting

from the i-can't-block-that,-dave dept

In our Moderator Mayhem mobile browser game that we released back in May, there are a couple of rounds where some AI is introduced to the process, as an attempt to “help” the moderators. Of course, in the game, the AI starts making the kinds of mistakes that AI makes. That said, there’s obviously a role for AI and other automated systems within content moderation and trust & safety, but it has to be done thoughtfully, rather than thinking that “AI will solve it.”

Jumping straight into the deep end, though, is OpenAI, which just announced how they’re using GPT-4 for content moderation, and making a bunch of (perhaps somewhat questionable) claims about how useful it is:

We use GPT-4 for content policy development and content moderation decisions, enabling more consistent labeling, a faster feedback loop for policy refinement, and less involvement from human moderators.

The explanation of how it works is… basically exactly what you’d expect. They write policy guidelines and feed it to GPT-4 and then run it against some content and see how it performs against human moderators using the same policies. It’s also setup so that a human reviewer can “ask” GPT-4 why it classified some content some way, though it’s not at all clear that GPT-4 will give an accurate answer to that question.

The point that OpenAI makes about this is that it allows for faster iteration, because you can roll out new policy changes instantly, rather than having to get a large team of human moderators up to speed. And, well, sure, but that doesn’t necessarily deal with most of the other challenges having to do with content moderation at scale.

The biggest question from me, knowing how GPT-4 works, is how consistent are the outputs? The whole thing about LLM tools like this is that every time you give it the same inputs it may give wildly different outputs. That’s part of the fun of these models that rather than looking for the “correct” answer, it just creates the answer on the fly by taking a probabilistic approach to figuring out what it should say. Given how one of the big complaints about content moderation is “unfair treatment” this seems like a pretty big question.

But not just in the most direct manner: one of the reasons why there are so many complaints about “unfair treatment” between “similar” content is that one thing a trust & safety team often really needs to understand is the deeper nuance and context between claims. Stated more explicitly: malicious and nefarious users often try to hide or justify their problematic behavior by presenting it in a form that mimics content from good actors… and then using that to play victim, pretending there’s been unfair treatment.

This is why some trust & safety experts talk about the “I wasn’t born yesterday…” test that a trust & safety team needs to apply in recognizing those deliberately trying to game the system. Can AI handle that? As of now there’s little to no indication that it can.

And, this gets even more complex and problematic when put into the context of the already problematic DSA in the EU. That requires that platforms explain their moderation choices in most cases. Now, I think that’s extremely problematic for a variety of reasons, and is likely going to lead to even greater abuse and harassment, but it will soon be the law in the EU and then we get to see how it plays out.

But, how does that work here? Yes, GPT-4 can “answer” the question, but again, there’s no way to know if that answer is accurate, or if it’s just saying what it thinks the person wants to hear.

I definitely think that AI tools like GPT-4 will absolutely be a helpful tool for handling trust & safety issues. There is a lot they can do in assisting a trust & safety team. But we should be realistic about its limits, and where it can be best put to use. And OpenAI’s description on this page sounds naively optimistic about some things, and just ill-informed about some others.

Over at Platformer, Casey Newton got a lot more positive responses from a bunch of experts (and they are all experts I trust) about this offering, noting that, at the very least it might be useful in raising the baseline for trust & safety teams, allowing the AI to handle the basic stuff, and pass along the thornier problems for humans to handle. For example, Dave Willner, who until recently ran trust & safety for OpenAI noted that it’s very good in certain circumstances:

“Is it more accurate than me? Probably not,” Willner said. “Is it more accurate than the median person actually moderating? It’s competitive for at least some categories. And, again, there’s a lot to learn here. So if it’s this good when we don’t really know how to use it yet, it’s reasonable to believe it will get there, probably quite soon.”

Similarly, Alex Stamos from the Stanford Internet Observatory said that in testing various AI systems, his students found GPT-4 to be really strong:

Alex Stamos, director of the Stanford Internet Observatory, told me that students in his trust and safety engineering course this spring had tested GPT-4-based moderation tools against their own models, Google/Jigsaw’s Perspective model, and others.

“GPT-4 was often the winner, with only a little bit of prompt engineering necessary to get to good results,” said Stamos, who added that overall he found that GPT-4 works “shockingly well for content moderation.”

One challenge his students found was that GPT-4 is simply more chatty than they are used to in building tools like this; instead of returning a simple number reflecting how likely a piece of content is to violate a policy, it responded with paragraphs of text.

Still, Stamos said, “my students found it to be completely usable for their projects.”

That’s good to hear, but it also… worries me. Note that both Willner and Stamos highlighted how it was good with caveats. But in operationalizing tools like this, I’m wondering how many companies are going to pay that much attention to those caveats as opposed to just going all in.

Again, I think it’s a useful tool. I keep talking about how important it is in all sorts of areas for us to look at AI as a tool that helps improve base level for all sorts of jobs, and how that could even revitalize the middle class. So, in general, this is a good step forward, but there’s a lot about it that makes me wonder if those implementing it will really understand its limitations.

Filed Under: , , , , ,
Companies: openai

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “OpenAI Says GPT-4 Is Great For Content Moderation, But Seems… A Bit Too Trusting”

Subscribe: RSS Leave a comment
10 Comments
This comment has been deemed insightful by the community.
TKnarr (profile) says:

It's not just the content

One of the facts of life about content moderation is that it’s not just the content that moderators have to look at, it’s the history of the account and the person behind it. The extremes are clear-cut yes/no decisions. In between though the decision often depends on whether the poster is a normally good poster who follows the rules but slipped up this once and went too close to the line, an agitator who constantly pushes the line trying to go as far as he can without getting dinged, or a bad actor who consistently breaks the rules and happened not to behave quite so badly this time. The whole thing is hard enough for humans who are committed to acting responsibly when moderating. Hand it off to tier-1 contract monkeys following a script for decisions and it becomes a mess. Hand it off to AI that doesn’t have any concept of “lying liar who lies” and it’ll pass “cesspit” falling fast and accelerating.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...