How this works

This page explains the design choices behind the indicator-first detector.

We trained this model to mark the text itself, instead of giving one classifier score, because we wanted to see the actual signals that make text look AI written or human written. For example, phrases like "moreover", "in conclusion", or very polished transition language can lean AI, while first person slips, rough phrasing, or awkward edits can lean human. The model has to find those tells on its own, and we keep the result on the text so the evidence is visible.

We used GRPO because we wanted the model to learn from groups of rollouts on the same document, not from a single fixed answer. That helps the model compare different attempts, which is useful when one rollout finds a real AI tell and another one finds something weak or irrelevant. We also enforce diversity by using different prompt templates, like AI-directed and human-directed variants, so the model sees multiple angles and does not collapse into one style of explanation. The goal is that the LLM identifies the AI and human tells on its own, not that we hand it a fixed detector rule.

Separating tell generation from scoring is also important, because it reduces reward hacking. If the same model both invents the reason and assigns the score, it can cheat by writing flattering explanations for weak spans. By separating the two steps, the model proposes spans and reasons, and a separate scorer judges how convincing they are. That makes the system closer to an evidence finder than a standard classifier, and we think that is more useful when the goal is to explain what looks AI-like or human-like in the text.