Article at-a-glance
– ZeroGPT is more accurate than GPTZero in assessing AI-generated casual blog content: 95% accuracy in detecting AI (vs. 84% for GPTZero), and 0% AI detection for human content (vs. 3% for GPTZero).
– For other types of human-written content (fiction, news reports, political speeches), ZeroGPT is significantly less accurate than GPTZero in correctly identifying human content (ZeroGPT assigns 30% AI likelihood, on average, to human content – versus just over 4% for GPTZero).
– Both testers are unreliable and have substantial risks.
– ZeroGPT has a high chance of false positives (30% average AI probability for human copy, with a 50% rate of false positives).
– GPTZero has a high chance of false negatives (41% AI probability on average for AI copy, with a 35% rate of false negatives)
AI content checkers are being increasingly used, from teachers assessing student work (and the students writing said work) to Google itself.
We selected two of the most popular ones – GPTZero and ZeroGPT – and tested them on both AI and human copy to see just how accurate they are.
The results are occasionally funny but dispiriting overall.
Until then, though, you’re probably wondering:
How’s that different from other similar tests?
For one thing, we’re directly comparing two of the most popular AI checkers against a large-ish database of content.
Most (if not all) AI checker tests were done using the opposite approach – comparing a large number of testers on a limited number of samples.
That’s not ideal, though – too little data tends to yield unreliable results (and we’ll show later on how testing on a single piece of content with a single checker is virtually useless).
Yes, testing the same copy with multiple checkers will show you which of those checkers performs well on that particular copy, but drawing conclusions beyond that isn’t warranted – there’s simply too much variation from test to test.
We tried to fix that, so we went for a larger database (40 pieces of content in total).
Why GPTZero and ZeroGPT?
Simply put, because they’re the most popular dedicated AI checkers based on traffic and search visibility (Quillbot’s AI detector is up there too, but it’s part of the Quillbot AI-assisted writing package – still worth looking at it and we’ll probably do that in a future piece!).
They’re also two of the more accurate based on recent testing, with a recent ZDNet test rating ZeroGPT and GPTZero as 80% and 100% accurate, respectively, with the competition doing considerably worse.
Those numbers seem promising, but they don’t really paint the full picture – and the ZDNet test does warn against relying on them too much. Their results aren’t reliable and will vary from test to test.
But how much do they vary? Being right occasionally or even “in general” doesn’t work – we wanted to know exactly how reliable (or unreliable) these popular testers are, so we ran a test on just the two of them, with enough copy to help us draw meaningful conclusions.
This is part 1 of a two-part series where we look at AI checkers and testers. In part 2, we tested AI humanizers, DeepL and SurferSEO.
You can also check the in-depth reviews we did for various AI tools like Shortly AI, Wordtune, Jasper AI, and Outwrite AI.
Here’s what we did:
- Generated the most AI-ish casual blog samples out there – asked Chat GPT-4o for 10 popular blog niches, then asked it to come up with short blog samples on topics of its choice; gave it no style prompt no nothing, fingers crossed that the resulting copy will be as painfully AI as possible (this will be the AI control sample, which we know is 100% AI);
- Selected
– short stories from the 1840s-1920s,
– stories from the 1990s, and
– political speeches from 1980-2013,
– 10 samples in each category (this will be the human-copy control sample, which we know is 100% human) - Tested GPTZero and ZeroGPT on both samples and compared the scores.
And here’s what we found out…
1. ZeroGPT is overall more accurate in assessing AI in blog posts than GPTZero…
- Both GPTZero and ZeroGPT correctly identified our AI blog control group as AI, but ZeroGPT was more accurate, giving a 95% average probability, while GPTZero was only 84% certain on average that the in-your-face AI output we produced was, in fact, AI;
- Both testers identified our 100%-human, 0%-AI blog samples as non-AI, with a minor difference once more in favor of ZeroGPT (who was positive that there’s a 0% chance that our unhinged blog-like babble was AI, versus a slightly higher but still respectable 3% probability given by GPTZero)
- ZeroGPT had a lower rate of false negatives (AI content incorrectly identified as 30% or less likely to be AI), at just 10% (vs 35% for GPTZero)
2. …but GPTZero is more accurate overall than ZeroGPT, with fewer false positives that incorrectly label human content as AI
- Once we moved past casual blogging, ZeroGPT showed its limitations, assigning on average a 30% probability that human copy is AI (including some downright funny-if-they-weren’t-sad numbers like 76% AI for Arthur Conan Doyle’s 1891 short story A Scandal in Bohemia or a whopping 93% for George W. Bush’s 2008 State of the Union Address!), with a 50% rate of false positives (content assigned as 20% or more likely to be AI);
- GPTZero did much better when testing human non-blog content, with an average AI probability of 4.3% and a false positive rate of 3.3% (only one out of 30 samples tested was rated above 20% likely to be AI – a 1987 speech by Jimmy Carter)
The final rates of false negatives and false positives are as follows:
- False Positive:
- GPTZero: 3.3%
- ZeroGPT: 50%
- False Negative:
- GPTZero: 35%
- ZeroGPT: 10%
What Does This Mean?
For one, ZeroGPT performed significantly better than GPTZero when assessing casual blogs (and that’s important, as it did worse elsewhere): it identified AI copy as AI with higher certainty, it was less likely to be fooled by AI humanizers, and it maintained a flawless and respectable 0% AI when checking legit human copy (GPTZero gave it a slightly more cautious 3% probability of being AI). So for being a free tool, it did pretty well with casual blog content.
But when we looked at other types of human writing, ZeroGPT showed its limitations (and remained true to its infamous reputation as making up AI scores for undeniably non-AI copy). It rated clearly human copy (19th and early 20th century short stories, news reports from the 90s, and political speeches from the late 20th and early 21st century) as highly likely to be AI, giving them a 30% AI probability on average, with a 50% rate of false positives.
Let’s hope they are wrong, or our history comes into question and maybe aliens did bring AI to earth hundreds of years ago!
So ultimately neither of these testers are very accurate; and while GPTZero has lower rates of false results overall, it’s still ridiculously risky, with a 35% rate of false negatives – and a non-insignificant chance of showing the odd false positive, too.
Essentially these results mean that if you use any tester on a single text, you’re just as likely to be fed a fake result as a good one; ZeroGPT has similar chances of rating human copy as 0% AI as it does 60% AI, and GPTZero might very well rate every third AI article as “probably human”.
If you had to use one though, we’d probably recommend GPTZero simply because it’s less likely to punish human writers; there’s really nothing worse for a writer (or editor!) than having your copy labeled as “likely AI” when you know very well you didn’t touch an AI tool.
Now there may be some merit in looking at why human copy was labeled as likely AI by ZeroGPT – or why AI copy was labeled as likely human by GPTZero – but that’s for another article.
Until then, AmpiFire can help you drive more visibility to your business with quality content development and distribution – get in touch today to see what we can do for you!
Authors
-
-
CEO and Co-Founder at AmpiFire. Book a call with the team by clicking the link below.