Spoilt For Choice, Starved of Progress

“Students can now generate impressive products without engaging in meaningful learning.” Mehran Sahami, Tencent Chair of the Computer Science Department and Professor at Stanford’s School of Engineering, made that observation at Stanford’s AI+Education Summit earlier this year. The summit, hosted by the Stanford Institute for Human-Centered AI, framed the underlying condition as an assessment crisis.

The crisis is less about a shortage of resources than about the inability to read them. A 2019 randomized study found students in active-learning sections learned more yet rated passive lectures higher and reported feeling they had learned more—a clean demonstration that subjective experience can point directly away from actual outcomes. That divergence isn’t accidental. When learners choose tools by polish, popularity, volume, and ease of use, they’re relying on signals optimized for adoption, not for exam performance. The governing tension is between a resource’s surface presentation and its structural capacity to drive learning—and it runs through every decision about which tools to use, recommend, or trust.

The Abundance Trap

The misalignment isn’t just an individual oversight—it’s institutional. Schools distribute curriculum documents but rarely teach students or teachers how to evaluate the resources meant to meet those objectives. Teacher preparation programs cover content and classroom management far more than methods for assessing platform quality. A 2022 EdWeek Research Center survey of 1,343 educators found that a majority felt poorly or not at all prepared by their training programs to evaluate education technology, a preparation gap that has persisted through years of accelerating tool adoption.

The scale of what fills that gap is striking. Instructure, working with InnovateEDU, released a 2026 Evidence Report analyzing 150 classroom technologies against federally recognized research standards. It found that 40% of purpose-built education tools had identifiable Every Student Succeeds Act (ESSA)-aligned evidence, compared with just 2% of consumer technologies used in classrooms. The report is corporate-commissioned, so its figures should be read with that context in mind. But the pattern it describes—products saturating classrooms and homework routines without rigorous evidence that they improve learning—is consistent with the structural gap the survey data already suggests.

In that environment, accumulating tools can feel like progress in its own right. Students bookmark websites, subscribe to channels, download apps, and schools assemble long lists of approved platforms. The volume creates an impression of preparedness. What it obscures is the distinction between how a resource appears—clean interfaces, large libraries, enthusiastic testimonials—and how it is built. The design choices that drive exam performance sit in the architecture, not the interface. And under exam conditions, invisible architecture has visible consequences.

What Quality Actually Looks Like

The least visible dimension of any learning resource is often the one that matters most: whether it requires students to produce answers or merely recognize them. Questions that demand recall from memory train the mechanism exams actually test. Recognition-based formats build familiarity, but without the same retrieval demand. Meta-analytic reviews confirm the distinction—Rowland (2014) found that tests can function as effective learning events, and Adesope, Trevisan & Sundararajan (2017) concluded that practice tests are generally more beneficial for learning than restudying. A second dimension is feedback quality: explanations that identify which concept was misunderstood and why allow students to repair specific gaps. Bare scores mainly confirm existing competence without changing it.

Provenance asks who built the content—practicing educators with direct experience of how a qualification is assessed, or teams without that proximity to the mark scheme. That distinction shapes whether question formats, command terms, and mark allocations mirror real high-stakes exams or approximate only the topic labels. Difficulty calibration asks whether ‘easy,’ ‘medium,’ and ‘hard’ are anchored to the cognitive demands of the target exams, or to a looser sense of length and complexity. Together, provenance and calibration determine whether practice rehearses actual exam conditions or a plausible-looking substitute.

These dimensions are hard to see because most evaluators are drawn to surface signals—interface quality, output richness, the speed and confidence of responses. That challenge intensifies where AI capabilities have become a standard feature of EdTech platforms. The problem Sahami named at Stanford—that students can generate impressive products without engaging in meaningful learning—applies directly to how AI-enabled tools reshape evaluation. When a platform can produce polished, coherent outputs on demand, the structural questions about whether retrieval is required, whether feedback addresses conceptual errors, and whether difficulty is calibrated to actual exam demands become easy to skip entirely. Google’s position within the broader AI-enabled EdTech landscape illustrates where that dynamic plays out at scale. The evaluative question shifts from what a tool can produce—in a market saturated with capable AI, that bar is easily cleared—to what the design actually requires of the learner. When impressive outputs are available regardless of instructional design, the structural criteria that separate learning from its simulation become harder to apply, not easier.

The Peer Evaluation Problem

In the absence of formal frameworks, peer judgment has become the default selection mechanism. Students ask classmates what they used to revise, search forums for recommendations, and watch study influencers compare platforms. When the peers involved have recently navigated the same syllabus and can describe how closely a resource’s difficulty and style matched the papers they sat, that feedback can carry genuine weight.

Yet peer evaluation reliably struggles with the dimensions that matter most. Interface refreshes are routinely mistaken for pedagogical improvement—a confusion the EdTech industry has little incentive to correct, given that a redesign is cheaper to ship and easier to demo than better-calibrated questions. When question formats, feedback mechanisms, and content provenance remain unchanged beneath a cleaner interface, the perceived quality signal and the structural reality have simply parted ways. Marketing narratives and early network effects can then push a platform to prominence because it looks organized and feels reassuring, not because its architecture is aligned with how learning actually works.

The gap between popularity and structural quality is visible at system scale. Tools that appear most often on recommended-resource lists tend to be those that spread quickly through word of mouth or come bundled conveniently with existing systems, not necessarily those with the strongest evidence base. Peer adoption becomes a proxy for visibility rather than a measure of structural alignment—and that proxy is least reliable precisely where alignment demands are most exacting: in subjects where how questions are phrased, weighted, and marked is not a matter of convention but of the exam itself.

Reading the Design

Mathematics makes structural misalignment unusually easy to spot. An exam question either reflects how marks are allocated, how command terms are used, and where students most reliably lose marks under pressure—or it doesn’t. For students, that gap determines whether practice prepares them for actual assessment demands or for a parallel version of the subject that looks similar but tests something adjacent.

Revision Village, an online revision platform for IB Diploma and IGCSE students and teachers, shows how those structural differences become readable in design. Its core Questionbank for IB and IGCSE mathematics contains thousands of syllabus-aligned, exam-style questions that students can filter by topic and difficulty. Each question is paired with a written markscheme and a step-by-step video solution produced by experienced IB educators, including examiners and classroom teachers. That matters more than it sounds. Most revision resources show you the right answer without ever showing you how the marks actually move. That combination—examiner-authored questions alongside explicit mark allocation logic—makes the architecture legible: a student can see not just what the correct answer is, but how marks are earned and lost and on what grounds. Exam-style formats demand full produced answers; difficulty filters make calibration explicit; and examiner involvement ties the content directly to the assessment logic of the qualification. The platform reports use by more than 350,000 IB students from over 1,500 schools in more than 135 countries, which means these design decisions operate at a scale where their structural logic carries real weight.

The point isn’t that one platform has solved mathematics revision. It’s that its architecture can be read—and read without marketing doing the interpretation. A student or teacher conducting a Revision Village review who knows the structural criteria to apply can work through the Questionbank design, the markschemes, and the solution videos and reach a defensible conclusion. The design either holds up under those criteria or it doesn’t.

The Credibility Hierarchy and the Personal Framework

Review credibility comes down to proximity. A subject-focused educator who can judge whether questions match a specific syllabus and mark scheme is better placed to assess a platform than a general study-skills influencer, however large their audience. The more specific the exam—in format, command terms, and marking logic—the more that proximity matters.

The same logic applies at the institutional level; it just takes longer to formalize. Some school systems are now building evaluation rubrics that ask whether tools are safe, evidence-informed, inclusive, interoperable with existing systems, and usable in real classrooms. Reassuring, if slightly overdue given the years of tool adoption that preceded the rubrics. The shift moves institutional attention from how a product looks to whether it works, for whom, and under what conditions. Individual evaluators and formal frameworks are ultimately answering the same question from different distances.

Those questions work just as well as a personal checklist. Does practice require producing answers under exam-like conditions? Does feedback identify specific errors clearly enough to fix them? Who created the content, and do difficulty labels correspond to official papers? Applied consistently, those checks build evaluation literacy—a practiced capacity to judge tools by design rather than by appearance, in any domain where the stakes of getting it wrong eventually show up somewhere measurable.

The One Resource Worth Finding

Today’s learners face a genuine paradox: surrounded by more tools than any previous generation, operating in a market optimized for engagement and adoption rather than for the match between preparation and exam reality. The assessment crisis identified at Stanford’s AI+Education Summit is one symptom—not of too little technology, but of too little capacity to tell the productive kind from the merely convincing.

With a clearer evaluative frame, that capacity is learnable. The most resourced generation of learners can become the most precisely equipped—not by collecting more tools, but by developing the habit of asking whether what’s in front of them is designed to build skill or merely designed to resemble it. That question doesn’t expire when the exam does. The next time someone is deciding whether a professional certification course, a corporate training program, or a new platform is worth the time, they’re running the same four criteria—whether they know it or not.