Datasets Framework Bench Case study Funders Contact

A public benefit research initiative

k12eval
Open evaluation infrastructure for AI in K-12.

AI is rapidly becoming part of how teachers grade, give feedback, and respond to student work. Districts are adopting tools faster than the field can evaluate them. K12Eval is building the public corpus, the validation methodology, and the reference benchmark the next decade of educational AI will need. We are starting by aggregating what already exists.

Explore the framework Browse the dataset index

Open datasets aggregated

Artifacts planned in the stack

3-rater

Gold-standard scoring

EN / ES

Target bilingual coverage

Free public resource

We also maintain a free index of 11 open datasets the AI-grading field is using.

Including LEAR Lab, ASAP 2.0, PERSUADE-adjacent, and 8 more, all linked to their original sources. Curated as a public service alongside the rest of the framework.

Browse the index

Why this is a public good

AI grading is becoming infrastructure. The data and benchmarks behind it haven't kept up.

Within five years, most U.S. classrooms will use some form of AI to grade student work or give feedback. Districts are adopting these tools faster than the field can study them. State procurement officers are asked to evaluate vendors with no shared yardstick to compare them against.

Foundational research exists. ASAP, PERSUADE, ETS corpora, and the broader automated-essay-scoring literature have shaped the field. But that work is small, dated, narrow in subject coverage, and predates the frontier LLMs now being deployed in classrooms.

There is no current, comprehensive, multi-vendor benchmark for AI grading. There is no widely adopted protocol for measuring demographic fairness in automated scoring. There is no contemporary corpus that captures how teachers actually grade, in real classrooms, using real rubrics, including bilingual learners.

K12Eval is being built to update that infrastructure for the LLM era. The largest teacher-annotated K-12 corpus in existence is being released under open licenses, alongside the methodology and the benchmark to evaluate any AI grading system against human educators.

State of the field, 2026

Honest assessment, with prior art named.

Open K-12 corpora with teacher labels

Small + dated

ASAP (~17K essays, 2012), PERSUADE (~25K, 2022), ETS corpora. Foundational but predate the LLM era and narrow in scope.

Public benchmarks for AI grading

Single-purpose

Existing benchmarks (ASAP-AES Kaggle, etc.) score essays only, predate frontier LLMs, and don't cover vendors, fairness, or bilingual learners.

Standardized fairness audits

No standard

Academic work on AES fairness exists (Loukina, others), but no widely adopted protocol. Vendors don't disclose. Districts have no shared yardstick.

The framework

Nine open artifacts. Three categories. One public infrastructure.

K12Eval ships a stack of artifacts that together form the corpus, the methodology, and the public surface for evaluating AI grading systems. Each ships under open licenses. Status reflects what is available now versus in development.

Category 1

Data

AvailableTier 1

Production corpus

10M+ student submissions where every label is the result of a teacher reviewing, editing, and approving an AI-proposed score. Bilingual coverage across EN and ES.

AvailableTier 2

IRR-validated subsets

Statistically validated samples where AI scores have been compared against credentialed human raters on real high-stakes assessments. First study: Success Academies, 2026.

Funder-supportedTier 3

k12eval-bench

10,000+ gold-standard items independently scored by three expert raters and reconciled to consensus. The reference benchmark for AI grading in K-12.

Category 2

Methodology

Available

K12Eval framework

The methodology document. Defines metrics (QWK, agreement, bias, MAE), sample sizes, rater calibration, and fairness analysis for any IRR study on AI grading.

In development

Fairness audit toolkit

Open-source code and methodology to measure demographic bias in any AI grading system. Built for districts to use during procurement.

In development

Datasheets

Gebru-format documentation for every dataset: collection process, sample composition, intended use, known limitations, prohibited uses.

Category 3

Public infrastructure

Available

Rubric library

Every rubric used in the framework, organized by standard (CCSS, NGSS, TEKS, state-specific). Independently useful, openly licensed.

Live (v0.1)

Public leaderboard

Every major foundation model and AI grading vendor evaluated against human raters using a single shared methodology. Updated quarterly.

Planned 2027

Annual State of AI Grading report

Annual publication summarizing leaderboard movements, vendors evaluated, methodology updates, and field trends.

$ pip install k12eval

from k12eval import evaluate, load_dataset

score = evaluate(model="my-model", dataset="bench")

Foundations & program officers

The benchmark, methodology, and evaluator network are built with philanthropic support. The case for funding, the named-tier structure, and the application form live on the funders page.

See the funder prospectus →

k12eval · bench

The reference benchmark for AI grading in K-12.

k12eval-bench is the public leaderboard where every major foundation model and every commercial AI grading vendor is evaluated on the same human-rated answer key.

Open the bench Submit a model

v0.1 · current snapshot

1cograder-2.0

0.97

2Claude Opus 4.7

0.71

3GPT-4 Turbo

0.68

4Gemini 2.5 Pro

0.65

5Llama 3.1 405B

0.61

See full leaderboard →

First use case

March 2026

Success Academies used the K12Eval framework to evaluate cograder against the NY Regents.

The framework's first deployment in the field. One of the highest-performing charter networks in the country ran a full inter-rater reliability study using the K12Eval methodology. The subject of evaluation was cograder, an AI grading platform serving 1M+ K-12 students. The methodology was co-designed. The numbers are public.

Headline result: 98.1% ELA agreement and 0.97 math QWK against credentialed human raters.

Study conductor

Success Academies

50+ schools across NYC. 22,000+ students. Top-performing charter network in New York State for over a decade.

System evaluated

cograder

AI grading platform · 1M+ students

Quote pending, to be added with partner approval

"[A line from a Success Academies academic leader on what they tested, why they tested it, and what they learned. Roughly two sentences.]"

Title, Success Academies

NY Regents · ELA

N = 2,258

Argument + Text Analysis essays

98.1%

agreement with human raters within 1 rubric point

QWK

0.78

Pearson

0.79

MAE

0.37

Bias (AI - human): -0.04 pts · effectively neutral

NY Regents · Algebra I

N = 2,702

Open-response questions Q25-Q35

0.97

quadratic weighted kappa, well above the 0.80 'very strong' threshold

Within 1pt

97.9%

Pearson

0.92

MAE

0.23

Bias (AI - human): -0.02 pts · effectively neutral

Disagreements are published alongside agreement: 1.9% of ELA scores and 2.1% of math scores differed from human raters by more than one rubric point. The full report including methodology, sample composition, and per-question breakdowns is available on request.

About K12Eval

Convening organization

K12Eval is currently operated by cograder's research team as the founding steward, with cograder funding the v0.1 release as a public benefit contribution. The roadmap for v1 includes expanding governance to an independent advisory board funded by the initiative, with seats reserved for academic researchers, district leaders, and lead funders.

Open licenses

All datasets are released under CC BY 4.0. All code, eval scripts, and the methodology framework are released under MIT. The bench, the leaderboard, and the annual report are free to use, cite, and build on. The funded artifacts remain permanently public.

Get involved

Pick the door that fits.

For researchers

Free download

Start with our public index of open K-12 essay corpora, then layer on K12Eval Tier 1 and Tier 2 datasets, the methodology, datasheets, and eval suite. MIT and CC BY 4.0.

Browse the dataset index →

For EdTech vendors

Submit and sponsor

Submit your model to be evaluated on the public leaderboard. Sponsor the framework to expand vendor coverage.

Submit your model →Sponsor the framework →

For districts

Run your own study

Use the K12Eval framework to evaluate any AI grading vendor before procurement. Turnkey methodology, free to adopt.

Use the framework →

Something else in mind?

For press, academic partnerships, state agency engagement, or anything that doesn't fit the doors above.

[email protected] →

Datasets /funders [email protected]

k12evalOpen evaluation infrastructure for AI in K-12.