← Back to announcements
A technical research note on evaluating AI-generated learning content beyond fluency and factuality. We outline a practical scoring approach for instructional quality using relevance, groundedness, clarity, actionability, sharpness, and contrast.
Most AI-generated educational content fails in a subtle but important way: it is often correct, coherent, and still not very effective at teaching. The problem is rarely grammar, and it is not always factual accuracy. More often, the weakness lies in instructional usefulness. A learner can read a lesson, understand every sentence, and still leave without the ability to choose, explain, or apply what they have just read. This research note outlines a practical approach to scoring AI-generated learning content as teaching rather than as text. It argues that generic quality metrics such as relevance, clarity, and completeness are necessary but insufficient, and proposes additional dimensions such as sharpness and contrast to better capture instructional quality.
The evaluation of AI-generated content often relies on broad quality indicators: fluency, relevance, readability, and factual consistency. These measures are useful, but they do not fully address the demands of instructional content. Educational material has a different success condition. It must not only be correct, but also help a learner understand, distinguish, and use knowledge in practice.
This distinction becomes especially important in objective-aligned learning systems, where the goal is not merely to produce plausible content, but to support performance. In that setting, a well-written explanation that does not reduce confusion or improve decision-making is still a weak output.
This note presents an applied scoring perspective shaped by that problem. The central claim is that instructional quality requires a rubric that goes beyond generic text quality and includes dimensions that reflect decisiveness, distinction, and transfer.
Generic scoring systems tend to reward content that is:
These are useful signals. However, they do not reliably answer the question that matters most in learning design: does this help the learner perform?
A lesson can be highly readable and still fail instructionally. Common failure cases include:
In other words, generic quality measures tend to over-reward descriptive adequacy. Instructional quality requires something stronger: practical discriminability and transfer support.
Instructional content operates under a stricter standard than general informational text. A successful lesson should help the learner:
This means that content quality in education must be evaluated in relation to learner outcomes, not only textual properties. A technically correct lesson may still be weak if it leaves the learner unable to explain the topic clearly or choose appropriately between alternatives.
For this reason, educational scoring must be sensitive not only to semantic correctness, but also to instructional shape.
A practical scoring model for instructional content should include several dimensions. Some are familiar from broader evaluation frameworks; others arise from repeated failure patterns in teaching-oriented generation.
A useful rubric may include:
The first five dimensions are common in many evaluation systems. The last two become necessary when the goal is not just intelligibility, but effective teaching.
Sharpness is concerned with whether the content identifies what matters most.
Many AI-generated lessons are accurate but soft. They introduce a topic in broad, competent language without clarifying the real decision, tension, or distinction the learner needs to grasp. A lesson may describe several frameworks as “helping organizations manage security and compliance” without telling the learner why one would be chosen over another.
This is not simply a style issue. It is an instructional weakness. Learners often need orientation before detail. If the opening framing is too generic, the rest of the lesson may remain technically correct but pedagogically vague.
A sharp lesson does at least three things:
Sharpness, then, is a measure of instructional decisiveness.
Contrast is concerned with whether similar items are made usefully different.
This is especially important in lessons involving:
A common failure pattern in AI-generated instruction is the “isolated definition” problem: the content defines each item correctly, but not comparatively. The learner is given several valid descriptions but no strong basis for choosing between them.
Strong contrast usually requires same-axis comparison. For example:
Without this structure, the content may remain descriptive but not discriminative. That distinction matters because instructional success often depends on discrimination. The learner does not only need to know what each option is; they need to know how not to confuse them.
One of the most persistent failure modes in AI-generated learning content can be described as accurate but non-decisive.
In this pattern, a lesson:
At first glance, the content appears strong. Yet the instructional weakness becomes clear on review. The hero section frames the topic broadly rather than naming the learner’s real choice. The comparison is fragmented across multiple partial structures. One option is treated as primary while another remains secondary background. Boundaries such as “this is not a certification” or “do not use this when formal evidence is required” are absent or merely implied.
A generic evaluator may still score this highly. A learner, however, often finishes the lesson with the central confusion unresolved.
This is precisely the kind of mismatch that motivated the inclusion of sharpness and contrast in the scoring model.
Model-based evaluation is flexible, but it tends to be forgiving. A language model can recognize that content is coherent, relevant, and plausible, then still overestimate its teaching value.
To reduce this inflation, it is useful to combine semantic evaluation with structural signals. These signals do not replace human judgment, but they help capture recurring weaknesses that purely semantic scoring often misses.
Examples of useful structural indicators include:
The purpose of these signals is not to mechanize pedagogy, but to resist a common failure mode: over-scoring content that is well-formed but instructionally weak.
The most useful structural adjustments are not broad or heavy-handed. They are narrow correctives applied to repeated weak patterns.
For example:
These are not cosmetic issues. They directly affect whether the learner can use the content as intended.
In this sense, structural signals function as corrective pressure against semantic generosity.
No automated scoring system fully replaces human instructional judgment.
Human reviewers can detect qualities that are difficult to formalize completely:
For that reason, scoring systems benefit from calibration against reviewed examples over time. The value of human review is not only in catching errors, but in exposing systematic optimism in the automated evaluator.
In practice, automated scoring should be treated as a disciplined first pass rather than a final verdict.
Several broader lessons emerge from this scoring approach.
First, instructional quality is not reducible to textual quality. Good writing is important, but it is not the same as good teaching.
Second, comparison-heavy content deserves special treatment. It fails in ways that generic rubrics are poorly suited to detect.
Third, a recurring weakness in AI-generated lessons is not inaccuracy, but lack of decisiveness. This makes sharpness a necessary evaluation dimension, not a stylistic luxury.
Fourth, learners benefit from structure that helps them choose, distinguish, and apply. Evaluation should therefore reward instructional clarity at the level of decisions and boundaries, not only sentence-level coherence.
This approach remains incomplete in several ways.
First, some rubric dimensions still depend on interpretive judgment, even when formalized. Second, structural signals can improve evaluation, but they can also oversimplify if used too aggressively. Third, different objective types may require different teaching styles, which suggests that evaluation should eventually become more sensitive to pedagogic mode rather than relying on one generalized scoring pattern.
Current areas for improvement include:
These are practical research problems rather than merely implementation details.
AI-generated educational content should be evaluated by whether it teaches, not merely by whether it reads well.
That shift has real implications for scoring design. Once learner performance becomes the target, the evaluator must care about more than fluency, relevance, and correctness. It must also care about distinction, decision support, structure, and transfer. It must penalize content that is technically accurate but still instructionally weak.
In our view, the most important shift is simple to state: stop asking only whether the content is good text, and start asking whether the learner could do something better after reading it.
That is a harder standard. It is also the right one.