k12eval Open evaluation infrastructure for AI in K-12.
AI is rapidly becoming part of how teachers grade, give feedback, and respond to student work. Districts are adopting tools faster than the field can evaluate them. K12Eval is building the public corpus, the validation methodology, and the reference benchmark the next decade of educational AI will need. We are starting by aggregating what already exists.
We also maintain a free index of 11 open datasets the AI-grading field is using.
Including LEAR Lab, ASAP 2.0, PERSUADE-adjacent, and 8 more, all linked to their original sources. Curated as a public service alongside the rest of the framework.
AI grading is becoming infrastructure. The data and benchmarks behind it haven't kept up.
Within five years, most U.S. classrooms will use some form of AI to grade student work or give feedback. Districts are adopting these tools faster than the field can study them. State procurement officers are asked to evaluate vendors with no shared yardstick to compare them against.
Foundational research exists. ASAP, PERSUADE, ETS corpora, and the broader automated-essay-scoring literature have shaped the field. But that work is small, dated, narrow in subject coverage, and predates the frontier LLMs now being deployed in classrooms.
There is no current, comprehensive, multi-vendor benchmark for AI grading. There is no widely adopted protocol for measuring demographic fairness in automated scoring. There is no contemporary corpus that captures how teachers actually grade, in real classrooms, using real rubrics, including bilingual learners.
K12Eval is being built to update that infrastructure for the LLM era. The largest teacher-annotated K-12 corpus in existence is being released under open licenses, alongside the methodology and the benchmark to evaluate any AI grading system against human educators.
State of the field, 2026
Honest assessment, with prior art named.
Open K-12 corpora with teacher labels
Small + dated
ASAP (~17K essays, 2012), PERSUADE (~25K, 2022), ETS corpora. Foundational but predate the LLM era and narrow in scope.
Public benchmarks for AI grading
Single-purpose
Existing benchmarks (ASAP-AES Kaggle, etc.) score essays only, predate frontier LLMs, and don't cover vendors, fairness, or bilingual learners.
Standardized fairness audits
No standard
Academic work on AES fairness exists (Loukina, others), but no widely adopted protocol. Vendors don't disclose. Districts have no shared yardstick.
The framework
Nine open artifacts. Three categories. One public infrastructure.
K12Eval ships a stack of artifacts that together form the corpus, the methodology, and the public surface for evaluating AI grading systems. Each ships under open licenses. Status reflects what is available now versus in development.
Category 1
Data
AvailableTier 1
Production corpus
10M+ student submissions where every label is the result of a teacher reviewing, editing, and approving an AI-proposed score. Bilingual coverage across EN and ES.
AvailableTier 2
IRR-validated subsets
Statistically validated samples where AI scores have been compared against credentialed human raters on real high-stakes assessments. First study: Success Academies, 2026.
Funder-supportedTier 3
k12eval-bench
10,000+ gold-standard items independently scored by three expert raters and reconciled to consensus. The reference benchmark for AI grading in K-12.
Category 2
Methodology
Available
K12Eval framework
The methodology document. Defines metrics (QWK, agreement, bias, MAE), sample sizes, rater calibration, and fairness analysis for any IRR study on AI grading.
In development
Fairness audit toolkit
Open-source code and methodology to measure demographic bias in any AI grading system. Built for districts to use during procurement.
In development
Datasheets
Gebru-format documentation for every dataset: collection process, sample composition, intended use, known limitations, prohibited uses.
Category 3
Public infrastructure
Available
Rubric library
Every rubric used in the framework, organized by standard (CCSS, NGSS, TEKS, state-specific). Independently useful, openly licensed.
Live (v0.1)
Public leaderboard
Every major foundation model and AI grading vendor evaluated against human raters using a single shared methodology. Updated quarterly.
Planned 2027
Annual State of AI Grading report
Annual publication summarizing leaderboard movements, vendors evaluated, methodology updates, and field trends.
The benchmark, methodology, and evaluator network are built with philanthropic support. The case for funding, the named-tier structure, and the application form live on the funders page.
k12eval-bench is the public leaderboard where every major foundation model and every commercial AI grading vendor is evaluated on the same human-rated answer key.
Success Academies used the K12Eval framework to evaluate cograder against the NY Regents.
The framework's first deployment in the field. One of the highest-performing charter networks in the country ran a full inter-rater reliability study using the K12Eval methodology. The subject of evaluation was cograder, an AI grading platform serving 1M+ K-12 students. The methodology was co-designed. The numbers are public.
Headline result: 98.1% ELA agreement and 0.97 math QWK against credentialed human raters.
Study conductor
Success Academies
50+ schools across NYC. 22,000+ students. Top-performing charter network in New York State for over a decade.
System evaluated
cograder
AI grading platform · 1M+ students
Quote pending, to be added with partner approval
"[A line from a Success Academies academic leader on what they tested, why they tested it, and what they learned. Roughly two sentences.]"
Disagreements are published alongside agreement: 1.9% of ELA scores and 2.1% of math scores differed from human raters by more than one rubric point. The full report including methodology, sample composition, and per-question breakdowns is available on request.
About K12Eval
Convening organization
K12Eval is currently operated by cograder's research team as the founding steward, with cograder funding the v0.1 release as a public benefit contribution. The roadmap for v1 includes expanding governance to an independent advisory board funded by the initiative, with seats reserved for academic researchers, district leaders, and lead funders.
Open licenses
All datasets are released under CC BY 4.0. All code, eval scripts, and the methodology framework are released under MIT. The bench, the leaderboard, and the annual report are free to use, cite, and build on. The funded artifacts remain permanently public.
Get involved
Pick the door that fits.
For researchers
Free download
Start with our public index of open K-12 essay corpora, then layer on K12Eval Tier 1 and Tier 2 datasets, the methodology, datasheets, and eval suite. MIT and CC BY 4.0.