Research·AI Detection

How AI detectors actually work - perplexity, burstiness, and the math

May 20, 20268 min read

Modern AI detectors look at two signals: how predictable each word is, and how that predictability changes sentence to sentence. Here is what that means.

An AI detector is a classifier that analyzes text to estimate the probability it was written by a large language model rather than a human. As of 2026, most commercial detectors rely on two core statistical signals: perplexity and burstiness. Perplexity measures how predictable each word is given its context; burstiness measures how much that predictability fluctuates from sentence to sentence. Understanding these metrics reveals both what detectors can and cannot do, and why AI-generated text often has a recognizable statistical fingerprint.

What is perplexity and how do AI detectors use it?

Perplexity is a measure of how surprised a language model is by a given word, averaged across all words in a text. When a model encounters a word it considers highly likely given the context, perplexity decreases; when it encounters a surprising word, perplexity increases. AI detectors calculate the perplexity of a submitted text using a reference model (often GPT-2, GPT-3.5, or a similar architecture) and compare it to known baselines for human writing.

The key insight: LLMs generate text by selecting from a relatively narrow probability distribution. At each token, the model assigns high probability to a small set of coherent continuations. Humans, by contrast, have access to far more vocabulary, slang, emotional expressions, and contextual detours. If a human text shows very low perplexity across the board, it may signal that the writer was being unusually repetitive or formulaic, which is rare.

What is burstiness and why does it matter for detection?

Burstiness measures the variability of perplexity scores across a text. A bursty text has wild swings in how predictable its words are; a non-bursty text has consistent, stable perplexity. Human writing is typically high-burstiness because a writer may shift from casual language (low perplexity, high word frequency) to technical jargon (high perplexity, specialized terms) within the same passage.

AI models tend to produce low-burstiness text because they sample tokens from distributions that are relatively stable across the generation process. Even if a model is told to "write like a CEO" or "explain in simple terms," it applies that instruction uniformly, dampening the natural peaks and valleys of human code-switching. Research by OpenAI and independent researchers has shown that combining perplexity and burstiness scores improves detection accuracy over perplexity alone.

How do detectors actually calculate these metrics?

Most detectors follow a three-step pipeline. First, they tokenize the input text using a consistent tokenizer (e.g., GPT-2's byte-pair encoding). Second, they compute log-probability scores for each token using a reference language model, then derive perplexity as the exponential of the mean log-probability. Third, they segment the text into sentences or fixed-size windows and calculate the standard deviation or coefficient of variation of perplexity scores, which approximates burstiness.

Some detectors (like the UmanWrite AI detector) also factor in linguistic features such as rare word frequency, named entity density, or semantic coherence. Others use neural classifiers trained on labeled datasets of human and AI text, which learn perplexity and burstiness implicitly rather than computing them explicitly. The end output is typically a probability score between 0 and 1, not a binary verdict.

Detection signal	What it measures	Human text typical range	AI text typical range
Perplexity	Average word predictability (lower = more predictable)	80-200 (varies widely)	20-60 (more consistent)
Burstiness (std dev of perplexity)	Variability in predictability across sentences	30-80 (high variation)	5-20 (low variation)
Rare word frequency	Proportion of words outside top 5000 word list	15-35%	5-12%
Semantic coherence	How consistently on-topic the text is	Moderate (with tangents)	High (narrow focus)

Why do humanized AI texts often evade detection?

When AI text is rewritten or edited to introduce human-like variation, it artificially raises perplexity and burstiness. A humanizer tool achieves this by replacing common words with synonyms, breaking up long sentences, adding filler phrases, and injecting colloquialisms or redundancy. The result is text that mimics the statistical profile of human writing without changing the factual content.

This is not cheating the detector so much as it is narrowing the detector's signal. A detector trained only on raw LLM output has never seen examples of AI text that have been manually rewritten 5 times. The detector's training set is finite; humanization techniques exploit that gap.

Adding casual connectives ('you know', 'I mean', 'frankly') raises sentence-level burstiness
Substituting high-frequency words with rare synonyms ('vehicle' for 'car', 'terminate' for 'end') increases perplexity variance
Splitting compound sentences and adding short declarative statements creates burstiness spikes
Including personal anecdotes or speculative asides lowers semantic coherence slightly, mimicking human tangents
Repeating key concepts in different words increases vocabulary spread without lowering overall text quality

Why do detectors still produce false positives and false negatives?

No detector in 2026 achieves 100% accuracy because perplexity and burstiness are statistical, not deterministic, signals. A human who writes very formulaic text (e.g., a sales email template used repeatedly, or technical documentation) will have low perplexity and low burstiness. A skilled AI operator who edits the output rigorously will have high perplexity and high burstiness. The distributions overlap.

Additionally, detectors are trained on specific language models and domains. A detector trained on GPT-3 output may perform poorly on Claude or Gemini, which have different sampling behaviors. And a detector trained on English news articles may misclassify technical writing, creative fiction, or non-native English speakers.

False positive (human flagged as AI): occurs when a human writer is low-burstiness (habitual, formulaic, or neurodivergent writing styles) or writes in a domain where detector was not trained
False negative (AI flagged as human): occurs when the AI text has been humanized, the model used has naturally higher perplexity (like claude-3-sonnet), or the detector's threshold is set too high for recall
Accuracy drift: as LLMs improve and generate more varied outputs, detectors trained on older models become less reliable
Domain mismatch: a detector trained on social media posts will have different accuracy on academic papers or source code

What are the limitations of perplexity and burstiness as detection signals?

Perplexity and burstiness are model-relative. They depend entirely on which reference model you use to compute log-probabilities. A text may have low perplexity relative to GPT-2 but high perplexity relative to a smaller model. This means detectors are not measuring something objective about the text itself; they are measuring its fit to a particular model's beliefs.

Additionally, these metrics ignore semantic truth and usefulness. A detector cannot tell if a sentence is factually accurate, logically sound, or relevant. It can only estimate how likely the words are given the context. This means a detector could flag a perfectly coherent human argument while passing a plausible-sounding but incorrect AI hallucination.

How can you lower your detection risk without sacrificing quality?

The most practical approach is to humanize AI-generated content after drafting. Use a tool that reintroduces variation while preserving accuracy. Replace overused transitions ('however', 'therefore') with conversational ones. Vary sentence length deliberately. Add specific examples, names, or numbers that you verify independently. These moves raise perplexity and burstiness while deepening the actual content.

A second approach is to use AI as an outline or research aid, then draft the final text yourself. This hybrid workflow naturally produces higher burstiness because you are applying your own voice, pacing, and vocabulary choices. The AI handles the heavy lifting (brainstorm, structure, source synthesis); you handle the human layer (voice, judgment, verification).

Finally, always fact-check and test your output. Run it through UmanWrite's AI detector to see your score, then adjust if needed. If you score in the gray zone (40-60% AI probability), humanize further or add more original sections. The goal is not to fool detection at any cost; it is to create text that is genuinely useful and defensible.

AI detection is a game of signal and noise, not a binary gate. Perplexity and burstiness are powerful heuristics, but they are not foolproof. Understanding how they work gives you the power to write with AI responsibly, and to choose the right humanizing approach for your use case. Whether you are a student, marketer, or content creator, the key is transparency with your audience and alignment with your goals.

Frequently asked questions

+What is perplexity in AI detection?

Perplexity measures how predictable each word is given its context, calculated using a reference language model. Lower perplexity means the words are more predictable and typical of that model; higher perplexity means the words are more surprising or varied. AI detectors use perplexity as a signal because LLMs generate lower-perplexity text than humans do.

+How does burstiness differ from perplexity?

Perplexity is the average predictability of words across the entire text. Burstiness is how much that predictability varies from sentence to sentence. Human text is typically high-burstiness (predictability spikes and drops) while AI text is low-burstiness (predictability stays consistent). Detectors use both metrics together for better accuracy.

+Can you game an AI detector by using a humanizer?

Yes, humanizing tools can lower detection scores by increasing perplexity and burstiness. However, this is not 'gaming' the detector illegitimately if the text remains accurate and truthful. The best practice is to use humanization as part of a responsible editing process, combined with fact-checking and verification, rather than as a way to hide AI assistance.

+Are AI detectors accurate in 2026?

No detector is 100% accurate. Typical commercial detectors achieve 85-95% accuracy on raw AI text, but accuracy drops significantly on humanized text, domain-specific writing, and non-English content. False positives (human flagged as AI) and false negatives (AI passed as human) both occur at rates that matter for high-stakes use.

+Why do detectors fail on some AI writing and not others?

Detectors are trained on specific models and domains. Text generated by Claude or Gemini may have different perplexity profiles than GPT-3, so a detector tuned for one model may perform poorly on another. Additionally, skilled editing, domain-specific jargon, and deliberate humanization can all reduce detection signals.

+Is it unethical to use AI writing tools if I humanize the output?

It depends on context and transparency. Using AI for drafting, research, or brainstorming, then rewriting and fact-checking yourself, is standard practice and not deceptive. Submitting AI text as human-written to a teacher or publication without disclosure is dishonest. The ethical rule: disclose AI involvement when required, and always verify accuracy regardless.

+What is the best way to avoid AI detection while keeping my writing good?

Focus on quality and authenticity rather than evasion. Draft with AI, then edit aggressively: vary sentence length, add specific examples, inject your voice, and verify facts. Use a humanizer tool to formalize the editing process. Run your final text through an AI detector to check your score. The result is text that reads naturally and would pass scrutiny because it reflects genuine thought.

+What metrics do AI detectors use besides perplexity and burstiness?

Some detectors also measure rare word frequency (how many words fall outside the top 5,000), semantic coherence (how consistently on-topic the text is), named entity density, and linguistic patterns like pronoun frequency or clause structure. Neural-network-based detectors learn these features implicitly rather than computing them explicitly, which can improve accuracy but reduces interpretability.

Sources

OpenAI

#ai-detection#perplexity#burstiness