OPTEQ
OPTEQ
HomeExploreBlogRequest a demo

← Back to announcements

Knowledge · March 21, 2026
Research

How We Score AI-Generated Learning Content

A technical research note on evaluating AI-generated learning content beyond fluency and factuality. We outline a practical scoring approach for instructional quality using relevance, groundedness, clarity, actionability, sharpness, and contrast.


How We Score AI-Generated Learning Content

Abstract

Most AI-generated educational content fails in a subtle but important way: it is often correct, coherent, and still not very effective at teaching. The problem is rarely grammar, and it is not always factual accuracy. More often, the weakness lies in instructional usefulness. A learner can read a lesson, understand every sentence, and still leave without the ability to choose, explain, or apply what they have just read. This research note outlines a practical approach to scoring AI-generated learning content as teaching rather than as text. It argues that generic quality metrics such as relevance, clarity, and completeness are necessary but insufficient, and proposes additional dimensions such as sharpness and contrast to better capture instructional quality.

1. Introduction

The evaluation of AI-generated content often relies on broad quality indicators: fluency, relevance, readability, and factual consistency. These measures are useful, but they do not fully address the demands of instructional content. Educational material has a different success condition. It must not only be correct, but also help a learner understand, distinguish, and use knowledge in practice.

This distinction becomes especially important in objective-aligned learning systems, where the goal is not merely to produce plausible content, but to support performance. In that setting, a well-written explanation that does not reduce confusion or improve decision-making is still a weak output.

This note presents an applied scoring perspective shaped by that problem. The central claim is that instructional quality requires a rubric that goes beyond generic text quality and includes dimensions that reflect decisiveness, distinction, and transfer.

2. The Limits of Generic Quality Evaluation

Generic scoring systems tend to reward content that is:

  • topically aligned
  • grammatically sound
  • logically organized
  • semantically plausible

These are useful signals. However, they do not reliably answer the question that matters most in learning design: does this help the learner perform?

A lesson can be highly readable and still fail instructionally. Common failure cases include:

  • broad overview language that never clarifies what matters most
  • correct definitions that do not help the learner distinguish between similar concepts
  • recaps that restate information without improving recall or application
  • scenarios that confirm a point without creating any real judgment or decision pressure

In other words, generic quality measures tend to over-reward descriptive adequacy. Instructional quality requires something stronger: practical discriminability and transfer support.

3. Instructional Quality as a Distinct Evaluation Problem

Instructional content operates under a stricter standard than general informational text. A successful lesson should help the learner:

  • understand the core idea
  • distinguish it from related ideas
  • recognize when it applies
  • avoid likely mistakes
  • carry something usable back into work

This means that content quality in education must be evaluated in relation to learner outcomes, not only textual properties. A technically correct lesson may still be weak if it leaves the learner unable to explain the topic clearly or choose appropriately between alternatives.

For this reason, educational scoring must be sensitive not only to semantic correctness, but also to instructional shape.

4. A Practical Rubric for Learning Content

A practical scoring model for instructional content should include several dimensions. Some are familiar from broader evaluation frameworks; others arise from repeated failure patterns in teaching-oriented generation.

A useful rubric may include:

  • Relevance: how directly the content teaches the stated objective
  • Completeness: whether it covers the essential concepts, explanations, and support needed for that objective
  • Groundedness: whether its claims remain supported by the provided context and constraints
  • Clarity: whether the learner can follow the content without excess ambiguity or cognitive load
  • Actionability: whether the learner receives examples, cues, steps, or decision support that make the lesson usable
  • Sharpness: whether the lesson frames the real tension or decision clearly instead of drifting into generic overview language
  • Contrast: whether similar options, concepts, or frameworks are distinguished strongly enough for the learner to tell them apart in practice

The first five dimensions are common in many evaluation systems. The last two become necessary when the goal is not just intelligibility, but effective teaching.

5. Sharpness: Scoring Decisive Framing

Sharpness is concerned with whether the content identifies what matters most.

Many AI-generated lessons are accurate but soft. They introduce a topic in broad, competent language without clarifying the real decision, tension, or distinction the learner needs to grasp. A lesson may describe several frameworks as “helping organizations manage security and compliance” without telling the learner why one would be chosen over another.

This is not simply a style issue. It is an instructional weakness. Learners often need orientation before detail. If the opening framing is too generic, the rest of the lesson may remain technically correct but pedagogically vague.

A sharp lesson does at least three things:

  • names the real learner problem early
  • identifies the key difference or decision tension
  • keeps the explanation anchored to that tension throughout

Sharpness, then, is a measure of instructional decisiveness.

6. Contrast: Scoring Distinction Between Similar Options

Contrast is concerned with whether similar items are made usefully different.

This is especially important in lessons involving:

  • frameworks
  • methods
  • roles
  • categories
  • artifacts
  • decision paths

A common failure pattern in AI-generated instruction is the “isolated definition” problem: the content defines each item correctly, but not comparatively. The learner is given several valid descriptions but no strong basis for choosing between them.

Strong contrast usually requires same-axis comparison. For example:

  • what it is
  • who it is for
  • what output it produces
  • when to use it
  • when it is the wrong choice

Without this structure, the content may remain descriptive but not discriminative. That distinction matters because instructional success often depends on discrimination. The learner does not only need to know what each option is; they need to know how not to confuse them.

7. A Recurrent Failure Pattern

One of the most persistent failure modes in AI-generated learning content can be described as accurate but non-decisive.

In this pattern, a lesson:

  • defines multiple concepts correctly
  • uses professional language
  • includes a comparison of some kind
  • presents a scenario and recap

At first glance, the content appears strong. Yet the instructional weakness becomes clear on review. The hero section frames the topic broadly rather than naming the learner’s real choice. The comparison is fragmented across multiple partial structures. One option is treated as primary while another remains secondary background. Boundaries such as “this is not a certification” or “do not use this when formal evidence is required” are absent or merely implied.

A generic evaluator may still score this highly. A learner, however, often finishes the lesson with the central confusion unresolved.

This is precisely the kind of mismatch that motivated the inclusion of sharpness and contrast in the scoring model.

8. Why Semantic Judgment Alone Is Insufficient

Model-based evaluation is flexible, but it tends to be forgiving. A language model can recognize that content is coherent, relevant, and plausible, then still overestimate its teaching value.

To reduce this inflation, it is useful to combine semantic evaluation with structural signals. These signals do not replace human judgment, but they help capture recurring weaknesses that purely semantic scoring often misses.

Examples of useful structural indicators include:

  • whether a comparison-heavy lesson uses a real all-options comparison rather than fragmented pairwise comparisons
  • whether explicit boundary language is present
  • whether the opening frames a real problem rather than a generic topic
  • whether recap content reinforces learner memory rather than presenter behavior

The purpose of these signals is not to mechanize pedagogy, but to resist a common failure mode: over-scoring content that is well-formed but instructionally weak.

9. Structural Signals as Corrective Pressure

The most useful structural adjustments are not broad or heavy-handed. They are narrow correctives applied to repeated weak patterns.

For example:

  • if a lesson is clearly comparison-heavy but does not present a unified comparison across central options, contrast should be penalized
  • if the hero section uses generic overview language and never frames the actual choice or tension, sharpness should be penalized
  • if central options appear without clear “wrong choice when” or “this is not...” boundaries, contrast should be reduced

These are not cosmetic issues. They directly affect whether the learner can use the content as intended.

In this sense, structural signals function as corrective pressure against semantic generosity.

10. The Role of Human Review

No automated scoring system fully replaces human instructional judgment.

Human reviewers can detect qualities that are difficult to formalize completely:

  • whether the explanation feels memorable
  • whether the contrast is genuinely sharp
  • whether a scenario is too easy or too obvious
  • whether the learner’s likely confusion is actually resolved
  • whether the content would help in a real workplace setting

For that reason, scoring systems benefit from calibration against reviewed examples over time. The value of human review is not only in catching errors, but in exposing systematic optimism in the automated evaluator.

In practice, automated scoring should be treated as a disciplined first pass rather than a final verdict.

11. What This Approach Suggests

Several broader lessons emerge from this scoring approach.

First, instructional quality is not reducible to textual quality. Good writing is important, but it is not the same as good teaching.

Second, comparison-heavy content deserves special treatment. It fails in ways that generic rubrics are poorly suited to detect.

Third, a recurring weakness in AI-generated lessons is not inaccuracy, but lack of decisiveness. This makes sharpness a necessary evaluation dimension, not a stylistic luxury.

Fourth, learners benefit from structure that helps them choose, distinguish, and apply. Evaluation should therefore reward instructional clarity at the level of decisions and boundaries, not only sentence-level coherence.

12. Limitations and Ongoing Work

This approach remains incomplete in several ways.

First, some rubric dimensions still depend on interpretive judgment, even when formalized. Second, structural signals can improve evaluation, but they can also oversimplify if used too aggressively. Third, different objective types may require different teaching styles, which suggests that evaluation should eventually become more sensitive to pedagogic mode rather than relying on one generalized scoring pattern.

Current areas for improvement include:

  • tighter calibration against reviewed examples
  • stronger handling of comparison-heavy objectives
  • better evaluation of visual and artifact-based lessons
  • improved routing based on the dominant teaching style of the objective
  • more reliable detection of generic overview language that appears polished but teaches weakly

These are practical research problems rather than merely implementation details.

13. Conclusion

AI-generated educational content should be evaluated by whether it teaches, not merely by whether it reads well.

That shift has real implications for scoring design. Once learner performance becomes the target, the evaluator must care about more than fluency, relevance, and correctness. It must also care about distinction, decision support, structure, and transfer. It must penalize content that is technically accurate but still instructionally weak.

In our view, the most important shift is simple to state: stop asking only whether the content is good text, and start asking whether the learner could do something better after reading it.

That is a harder standard. It is also the right one.



#AI Evaluation#Instructional Design#Learning Content#Educational AI#Content Scoring#Rubric Design#LLM Evaluation#Content Quality#Pedagogy#EdTech#Research Note#Technical Blog