How to Handle Ambiguous AI Detection Results in Student Work

An AI detection tool returns a likelihood score of 61% on a student's coursework. It is not low enough to dismiss, not high enough to feel certain about, and the student is sitting in your next lesson. This is the moment that ambiguous AI detection results create — and it is a moment that more UK secondary teachers are facing every term as AI tools like ChatGPT and Claude become more widely available to students.

The honest answer is that a mid-range score does not tell you whether the student used AI. It tells you that the text shares some statistical characteristics with AI-generated writing. What you do with that information requires professional judgment, not just a number. This guide walks through what the likelihood score actually means, why ambiguity is built into AI detection, and how to build a process that is fair to students and defensible to colleagues and parents.

What the Likelihood Score Is — and What It Is Not

GradeOrbit's AI detection tool returns a likelihood score between 0% and 100%. A very low score suggests the text strongly resembles human-written work. A very high score suggests it strongly resembles AI-generated output. Everything in between — and there is a lot of everything in between — reflects genuine uncertainty.

The score is the output of a probabilistic model that analyses linguistic patterns: sentence structure, vocabulary distribution, syntactic consistency, and the statistical regularities that tend to distinguish AI-generated text from human prose. It is not a record of what software was used to produce the work. There is no digital signature attached to AI-generated text, and detection tools cannot read a student's browser history. They can only infer, from the characteristics of the writing itself, how likely it is that a human produced it unaided.

GradeOrbit offers two detection modes. The standard check uses 1 credit and provides a fast likelihood score suitable for routine checks. The in-depth analysis uses 3 credits and applies a more capable model, producing a deeper breakdown of the linguistic signals that contributed to the score along with a reasoning paragraph. For genuinely ambiguous cases — the ones this guide is about — the 3-credit analysis will usually give you more to work with.

Why Mid-Range Scores Are the Hardest Cases

If AI detection scores were reliably binary — clearly human at one end, clearly AI at the other — the job would be straightforward. In practice, the cases that matter most to teachers fall in the uncertain middle, and there are several reasons why.

Students who use AI as a starting point and then edit heavily will often produce work that scores in the mid-range. The more they rewrite, restructure, and add their own voice, the more the text moves away from the statistical patterns that detection models flag — but traces can remain. A student who used ChatGPT to generate a plan, pulled in some phrasing from that plan, and then wrote the rest themselves might produce work that scores anywhere from 30% to 70%, depending on how much of the original AI text survived.

Conversely, students who write with unusual fluency, consistency, and formal register — particularly highly proficient writers, students who have been extensively coached, or students who write in English as an additional language and have a very structured style — can produce work that scores surprisingly highly without any AI involvement at all. The detection model cannot distinguish between "this is what ChatGPT sounds like" and "this is what a very competent formal writer sounds like," because at a statistical level, they can look similar.

Subject register compounds this further. A student writing a GCSE History essay with a well-practised analytical structure, or an A-Level Psychology student applying the correct PEEL framework to every paragraph, may produce text that is more consistent and formally patterned than a typical human writer — and that pattern is precisely what detection models are sensitive to.

Steps to Take Before Drawing Any Conclusions

When a score lands in ambiguous territory, a structured approach will serve you better than a gut reaction. Here is a practical sequence that takes the detection score seriously without treating it as a verdict.

Compare Against Previous Work

Your most powerful tool is not the detection score — it is your existing knowledge of the student. Before anything else, look at previous examples of their writing: exercise book entries, earlier homework tasks, timed in-class responses. If the submitted work is dramatically different in quality, sophistication, or register from what they have produced before, that is a meaningful data point. If it is entirely consistent with their previous work — if this student has always written this way — that matters too, and the detection score becomes much less concerning.

Read the Text Carefully

AI-generated text tends to have characteristic patterns that a careful reader can often detect independently: uniform sentence length, a lack of genuine personal voice, heavy use of hedging phrases like "it could be argued" or "it is important to note," and a kind of competent generality that hits the required points without any authentic specificity. If you read the piece and it feels like the student — if there are idiosyncratic phrasings, genuine personal responses, or the kind of small imperfections that characterise student thinking — that is evidence worth weighing.

Consider the Submission Context

When did the work arrive? How does this student typically submit? A piece that arrives at 11:50pm on the deadline night from a student who usually struggles and typically submits rough work is a different situation from the same detection score on work submitted by a reliable, motivated student who always takes care over their writing. Context does not override the score, but it informs how seriously to weigh it.

Having the Conversation with the Student

If you remain genuinely uncertain after working through those steps, the most productive next move is a short, exploratory conversation with the student. The key is to approach it without accusation. Your goal is to understand the work better, not to confront the student.

Ask them to talk you through their argument. Ask where a particular piece of evidence came from, or how they decided to structure the piece the way they did. A student who wrote the work will be able to do this — perhaps imperfectly, perhaps with some gaps, but with the rough coherence of someone who genuinely thought through the ideas. A student who submitted AI-generated content with minimal engagement will often struggle to explain their own essay, or will give answers that don't match the sophistication of what they submitted.

This conversation is also protective. If the student genuinely did write the work and a high score is a false positive, this is the moment you discover that and can move on without having caused unnecessary distress. That outcome is just as important as identifying authentic AI use.

When to Escalate and When to Let It Go

Not every ambiguous score warrants formal escalation. If a score is mid-range, the work is consistent with the student's previous output, the writing passes a careful read, and the student can discuss their ideas coherently — let it go. Document your assessment, note the score for context, and move on. Treating every ambiguous result as a suspected infringement is not only unsustainable but risks damaging relationships with students who have done nothing wrong.

Formal escalation makes sense when multiple indicators converge: a high score, a significant departure from the student's previous work, unusual submission timing, and a conversation in which the student cannot account for their own ideas. At that point, follow your school's academic integrity policy closely. Involve a senior colleague before taking any action, document every step of your reasoning, and treat the detection score as one piece of evidence among several — not as proof on its own.

For schools that want a consistent approach to handling detection results across departments, our guide on how schools can implement AI detection consistently covers the policy and process considerations in detail.

How GradeOrbit Supports Responsible Detection

GradeOrbit's detection tool is designed to support teacher judgment rather than replace it. The output for every detection run includes the likelihood score, a breakdown of the specific linguistic signals that contributed to it, and — in the 3-credit in-depth mode — a reasoning paragraph that explains what the model found and why. This gives you more than a single number to work with: it gives you the evidence behind the number, which you can evaluate in the context of everything else you know about the student and the work.

Student work is never stored after processing. Content is sent for analysis and then discarded — it is not retained on GradeOrbit's servers and is not used to train any AI model. We recommend redacting any identifying information before submitting work, so the detection model never sees the student's name or other personal details. Our full guide on how to redact student information before AI detection explains how to do this quickly using GradeOrbit's built-in redaction tool.

Try GradeOrbit's AI Detection Feature

Ambiguous AI detection results are not a failure of the technology — they are an honest reflection of a genuinely difficult problem. No tool can tell you with certainty whether a student used AI. What a good tool can do is give you reliable, transparent information that you can fold into a fair, evidence-based process.

GradeOrbit's detection feature is built directly into your dashboard and works with pasted text, uploaded documents, or scanned images of handwritten work. The 3-credit in-depth analysis is there for exactly the cases where you need more than a headline score. Try GradeOrbit and see how it supports your existing approach to academic integrity.