Detecting AI text through perplexity, risks

When it comes to evaluating AI-generated text, perplexity has been a widely used metric. For those unfamiliar, perplexity is a measure of how well a language model predicts a given sequence of words. Lower perplexity means the text is likely according to the language model. You can think of it as the text ‘flows well’.

"He went to the store" -> low perplexity, flows well
"He avacadoed the shoe" -> high perplexity, does not flow well

The theory behind perplexity-based detection is that if the text has unusually low perplexity, it must have been generated by a language model. By definition a language model will generate low-perplexity text (as evaluated by itself).

Seems pretty clever, except that, if the language model is any good, it will assign well-flowing text a low perplexity as well. So, if you’re a good writer, these detection methods may flag your writing saying, essentially ‘this flows too well for a human to have written it’.

As teachers begin to clamp down on AI-based cheating, this method will also flag the best works from the best students as being potentially AI-generated, punishing students who have learned to write well.

There have been discussions for the bigger players (like OpenAI) to put codes into the text generated by their models (maybe embedding unusual word choices or other means that wouldn’t be easy to detect for the user) but it’s unlikely that other open-source models would follow suit.

I think the solution is actually to re-think how we evaluate writing mastery in the first place. It will be hard, but it might be the only sensible path forward.

(image by Stable Diffusion)

Punishing good writers?

Detecting AI text through perplexity, risks

Detecting AI text through perplexity, risks

Theory