People who frequently use AI for writing tasks were quite effective at AI detection, even without any specialized training.

Study: AI detection software varies in effectiveness


People who frequently use AI for writing tasks were quite effective at identifying AI-generated text, even without any specialized training

Key points:

The emergence of generative AI technologies such as ChatGPT has challenged educators to find effective ways to identify whether the work their students submit is original or AI-generated.

As educators look for help in making this determination, many are turning to automated AI detection technologies that claim to distinguish between human and AI-generated text. But not all such technologies work the same way or have the same success rate, a recent study found.

Led by Jenna Russell, a Ph.D. student in Computer Science at the University of Maryland, the study compared how well humans could detect AI-generated text as compared with commercial and open-source AI detectors. The study found that, among automated solutions, the AI detection program Pangram significantly outperformed the competition.

Pangram was “definitely the best detector we were able to test,” Russell says.

How the study worked

The study involved five different phases of increasing difficulty. For each phase, the researchers chose 30 unique nonfiction articles written by humans and created an AI prompt to generate a similar article of a similar length on the same topic, for a total of 60 articles within each phase.

In phase one, the researchers used GPT-4o to create the AI-generated articles. In phase two, they used Claude, an AI assistant built by the American company Anthropic. In phase three, they used a paraphrased version of content created by GPT-4o, similar to how many students might try to fool their teacher by paraphrasing the work of an AI generator. In phase four, they used o1-Pro, a more advanced version of ChatGPT. In phase five, they used a “humanized” version of content created by o1-Pro, in which words and phrases that sounded AI-generated were changed into more human-sounding language.

The researchers recruited five people who were experts at using generative AI and also in analyzing language, such as teachers, writers, and editors. They compared the performance of these five human experts to that of five automated AI detection solutions in distinguishing between human and AI-generated work: Pangram and GPTZero, both commercial software programs; Binoculars and Fast-DetectGPT, both open-source detection tools; and RADAR, a detection framework created by Chinese researchers.

Overall, Pangram’s technology was the only one to outperform all five individual experts in identifying AI-generated articles, with a 99.3-percent success rate. Pangram was almost perfect in the first four phases of the experiment, only misidentifying one of the human articles as AI, and it was 96.7-percent effective in identifying humanized o1-Pro content. (In this hardest phase, GPTZero was successful less than half the time–46.7 percent–and the open-source options really struggled.)

“Pangram is near perfect on the first four experiments and falters just slightly on humanized o1-Pro articles, while GPTZero struggles significantly on o1-Pro with and without humanization,” the report says. “The open-source detectors degrade in the presence of paraphrasing and underperform both [commercial] detectors by large margins.”

Pangram’s approach

Why was Pangram’s technology found to be more effective? The answer lies in how its software is trained, the company says.

Many automated AI detection programs use factors such as “perplexity” and “burstiness” to distinguish between human and AI-generated content. Perplexity is how unexpected each word is, while burstiness is the change in perplexity over the course of a document: If some surprising words or phrases appear throughout the text, then it’s high in burstiness.

The idea behind using these factors is that writing from humans tends to be more creative, with some unexpected flourishes–while machine-generated text is much more formulaic. But there are some shortcomings inherent in this approach, says Pangram co-founder Bradley Emi.

Chief among them is that emergent writers who are still learning the language and who might lack confidence in their writing–which describes many students, and English learners in particular–generally use less perplexity in their writing, which means their work could easily be misidentified as AI-generated content.

“During the language learning process, the student’s vocabulary is significantly more limited, and the student is also not able to form complex sentence structures that would be out of the ordinary … for a language model,” Emi writes. “We argue that learning to write in a high perplexity, bursty way that is still linguistically correct is an advanced language skill that [only] comes from experience with the language.”

Pangram works more effectively because it uses an approach called “synthetic mirrors,” in which it trains its software to detect AI by pairing every human writing sample with an AI-generated version of the same article. Whenever the model makes a mistake–either failing to identify the AI version or falsely characterizing the human version as AI–the company generates another synthetic mirror from this document and adds it to the training set. In this way, the software “learns” what AI-generated content looks like, much like a human would–by learning from its mistakes.

“With this training method, we were able to reduce our false positives by a factor of 100 and ship a model that we’re proud of,” the company notes in a technical report about its methodology.

Humans fare well

Perhaps surprisingly, Russell and her colleagues found that people who frequently use AI for writing tasks were quite effective at identifying AI-generated text, even without any specialized training or feedback.

Individually, the five experts ranged in effectiveness from 97.3 percent to 59.3 percent. Collectively, however, they were nearly perfect–with the majority vote among these experts misclassifying only one of the 300 articles.

“The majority vote of our five expert annotators substantially outperforms almost every commercial and open-source detector we tested,” the researchers wrote, “with only the commercial Pangram model … matching their near-perfect detection accuracy.”

The “mix of background knowledge on grammar rules and writing conventions allows people to spot a lot of inconsistencies in human writing,” Russell explains. “With greater use of gen AI, people learn the patterns,” such as the kinds of words and phrasings that tend to crop up in AI-generated versus human writing. She adds: “We found that our five experts all used a different set of individual clues. We hypothesize that if the experts were taught all the tools used by each expert, they would be even better at detecting [AI-generated] text.”

The study’s findings have important implications for educators, Russell believes.

“We know there are clues in AI-generated texts that humans can learn how to spot,” she notes. “This helps teachers (and everyone) approach text with a toolkit to see who is [doing the] writing.”

Often, teachers may run a student’s work through an AI detector and take the results at face value, independent of human oversight. Doing so could result in levying suspicion or an accusation that might be unwarranted. She concludes that teachers can learn to be better at providing this human oversight, stating: “We believe these skills can be leveraged to help a teacher feel more comfortable using AI detectors as a tool to aid their own suspicions, rather than blindly following a detection tool.”

Sign up for our K-12 newsletter

Name
Newsletter: Innovations in K12 Education
By submitting your information, you agree to our Terms & Conditions and Privacy Policy.

Dennis Pierce

Want to share a great resource? Let us know at submissions@eschoolmedia.com.

eSchool News uses cookies to improve your experience. Visit our Privacy Policy for more information.