I wrote recently about the idea of accountability—why it matters and how we do it wrong, especially in education. In many school districts, the office in charge of testing is called “assessment and accountability,” though it’s never quite clear who is being held accountable, and to what.
One thing I’ve learned about testing is that the closer to instruction it is done, in both time and space, the more useful the information will be. A classroom quiz at the end of the week is more instructionally useful to teachers than a final exam, and a classroom final exam aligned to the teacher’s day-to-day curriculum says more about what students learned than a state-wide assessment on large and sometimes vague “standards.” But our recent history hasn’t shown a lot of trust in teachers to do their jobs behind closed doors.
Being Formative
A classroom quiz at the end of the week is more instructionally useful to teachers than a final exam. If you see bad results on a Friday, you can take action on Monday to remediate the problem…unless you’re tied to a pacing calendar that allows you no room to breathe or flexibility to change plans.
Those kinds of assessments are called “formative,” because their focus is meant to be on the formation of skills and knowledge. Formative assessments can be as formal as a written test or as informal as “raise your hand if you think the answer is five.” The key to any of these assessments, from quick check-ins to quarterly benchmark tests, is that teachers can take action in response to them, as W. James Popham lays out elegantly in his book, Transformative Assessment.
It’s not just that teachers should have time to take action; it’s also that they should have a plan to do so—a plan they’ve worked out ahead of time. I’m going to check for understanding at the end of class today, and here are three things I might do tomorrow, depending on what I find out. Or even: I’m going to pause here, mid-class, to make sure everyone gets what I’m saying, and I have a plan in mind for what I’m going to do immediately thereafter, depending on what I learn.
When I first read Popham’s book, I was embarrassed to admit that I had never really thought about the importance of that planned follow-up. The follow-up actually matters more than the question: after all, what’s the point in taking someone’s temperature if you’re not going to do anything when you find out your child has a fever?
Far too often, we ask these check-in questions and then plow ahead with the lesson as planned, regardless of what we learn. Oh, we might correct a mistake when we hear one, but we don’t (nearly often enough) change what we were going to do next because of that mistake.
Standardization
One problem with formative assessment is that it often gets swallowed up in the accountability mania. It’s not enough for a teacher to assess her class, her way, on her content. She must use district-mandated tests based on a district-mandated curriculum, for review by district-level managers. Students must receive a grade. Everything formative becomes summative, with a focus on judgment and consequences, not learning.
I was involved in such a mandate when I worked for a company that was creating standardized curriculum and assessments for a large, urban school district in the early days of No Child Left Behind. There was a lot that was good about the project and a lot that was fraught and stupid, but one of the worst parts was designing the assessments, because the teachers pitched an absolute fit about being held accountable so visibly and transparently, at a district level, to teaching particular content. They understood that one purpose of those assessments was going to be to judge them, and they didn’t like it. So, they harassed district leadership until they got the superintendent to back down from his original idea. In the end, we were allowed to assess only the larger skills being taught, not any particular content in the curriculum. In Social Studies, for example, we could assess map-reading skills in general, but we couldn’t assess whether kids in American History classes knew the fifty states or how to find the Missouri River, both of which were part of the curriculum.
The further away from the classroom the testing is conceived, the more standardized it becomes, to make local variance less important and to generate comparative results across large populations. The more tests there are to score, the longer it takes to return results. No Child Left Behind gave us a ton of end-of-year state tests whose results didn’t show up until the following year, when they were useful to no one except bureaucrats who wanted to scold schools for underperformance.
Some states tried to do a good job with these tests and assess student learning as well as they could; some didn’t try very hard. I remember in the early days, when New Jersey published a sample high school exit exam in English. It had a lot of interesting components to it, including narrative writing based on a piece of visual art, and also some kind of timed live presentation. Not everything survived early review; a lot of the more interesting elements proved too subjective to score and took too long to review. Efficiency and cost-savings ended up being more important (which is why many states never even bothered trying anything except multiple-choice questions).
If you want valid and reliably standardized scoring, you need a standardized implementation across the entire population, and that can make it difficult to do things like, “gather resources and prepare a 15-minute presentation.” In New York, where the high school exit exam in English required a speech to be read aloud to students, one school was so worried about the variance in delivery across its faculty that it had the principal read the selection over the loudspeaker. If you’ve ever been in a big, old, school building, you know just how crappy the sound quality can be on those loudspeakers. But hey, at least it was equally crappy for everyone.
Because “education reform” is more often about pendulum swings than it is about forward motion, the era of high stakes exit exams may be coming to an end. Only six states still require them, and two of the more thoughtful and rigorous tests, in Massachusetts and New York, are being phased out, to be replaced by something theoretically more “authentic.”
“I wish them well,” as Princess Saralinda is doomed to say to every suitor who sets out on a dangerous quest to win her hand, in my favorite childhood book, The Thirteen Clocks.
What Might Better Look Like?
When people talk about “authentic” assessment, they tend to mean that they’re looking for ways to assess student knowledge and skills in contexts that more closely resemble the way we adults use them in everyday life—moving away from these artificial things we call tests and moving toward some kind of real-world performance or demonstration.
Ted Sizer, who wrote the influential books Horace’s Compromise and Horace’s School, talked about finding ways to demonstrate mastery by what he called exhibition. As one of his colleagues wrote, in an edition of the newsletter Sizer’s Coalition of Essential Schools used to put out:
Ted Sizer reached all the way back to the eighteenth century in search of an assessment mechanism that might function in this way. He found at least the possibility of it in a ubiquitous feature of the early American academies and of the common schools that shared their era. The exhibition, as practiced then, was an occasion of public inspection when some substantial portion of a school’s constituency might show up to hear students recite, declaim, or otherwise perform. The constituency might thereby satisfy itself that the year’s public funds or tuitions had been well spent, and that some cohort of young scholars was now ready to move on or out.
Joe McDonald in Horace newsletter, Winter 2007, Vol. 23 No. 1
Sizer outlined a number of possible small-scale “exhibitions” throughout his two books—things like asking middle school students to prepare a tax return for a family of four as a math assessment. But when it came to more summative and comprehensive assessments, his coalition of schools developed something bigger.
The “Rite of Passage Experience," or ROPE program at Walden III, which has been in place for over a decade, is a fully developed model of how such a requirement can function. Born from the Australian "walkabout" tradition in which a youngster must meet certain challenges to attain adulthood, ROPE is expressly designed "to evaluate students' readiness for life beyond high school."
Horace newsletter, March 1990, Vol. 6, No. 3
In the same newsletter, researcher Grant Wiggins, who went on to create the Understanding by Design framework with Jay McTighe, laid out some criteria for these larger-scale exhibitions:
I’ve seen bits and pieces of this kind of assessment, but never all of the bullet points at once, in something systematic and comprehensive. Maybe you have.
When I was teaching in Atlanta, in a very alternative school, we tried an assessment like this at the end of a pilot of an interdisciplinary unit on the theme of “utopias.” Students had studied both the Peloponnesian War and the Cold War, had read sections of Plato’s Republic and 1984; and had read short stories, poems, and bits of philosophy. They had responded to each item through quizzes and essays and such, but we wanted to do some kind of final assessment. So, we threw it open to the students and asked them to represent their idea of “utopia” in any form they chose. It was very loose and disorganized; we didn’t even have a scoring rubric. But the results were fascinating. Each student had to present their project and explain why and how it represented their thinking. One student wrote a song; another created a sculpture; one just did a guided mediation for the class. It was an interesting experiment. Where the school took that work after I left, I don’t know.
This kind of project-based learning and assessment can be fun and informative and engaging for students. It can also be garbage. Crafting the right project and figuring out what criteria you’re going to use to evaluate it are both very tricky. In practice, I’ve seen teachers grading the quality of an art project rather than its demonstration of academic skills or content knowledge. “I really enjoyed the way you edited your video” is lovely feedback, but if the video is meant to replace, say, a history test, the feedback needs more academic teeth than that.
When I was teaching conversational English in the Republic of Slovakia, I saw another element of the exhibition idea—actually, something more in line with the idea quoted above: “The constituency might thereby satisfy itself that the year’s public funds or tuitions had been well spent, and that some cohort of young scholars was now ready to move on or out.” The high school students didn’t have any exit exams, but there was a formal assessment period at the end of the year, when classes were suspended and members of the community were invited to join school leaders to sit behind a table and observe as teachers posed questions or gave performance tasks to students, to see if they could talk the talk and walk the walk in public, as educated adults. It was a genuine rite of passage for young people in the community, and everyone took it seriously. Kids came to school formal attire; parents brought tons of food (and alcohol) to sustain the judges all day. It was both festive and serious, and it made it clear that school mattered. I’ve never seen anything else like it.
Can Authentic Also be Efficient?
If schools have resisted large-scale, authentic assessment in the past because of the burden it places on teachers, computer technology has been advancing in ways that may help to ease that burden.
In the first and most obvious case, machine scoring can now handle a great deal more than old-fashioned multiple-choice questions. There is a wide range of what we call “technology-enhanced questions,” ranging from fill-in-the-blank questions, to drag-and-drop or matching tasks, to plotting lines on a graph or measuring angles, all of which can be scored instantly by the computer, and all of which is a bit more performance-oriented than traditional questions.
But most of these tasks are built with a single correct answer in mind, which keeps them from being truly real-world-like. When we have to use our math or language skills as adults, or call up our history and science knowledge, the problems are often ambiguous and unpredictable, and the solutions may vary, depending on how we approach them and what information or questions or assumptions we bring to them. There’s a big difference between a question like this:
And one like this:
What’s the challenge in asking a question like the second one? Every child will provide different information and will therefore have a different response. The teacher will not be able to grade the child’s performance unless she reviews the child’s thought process. It. Takes. Time.
Will generative AI be able to help? Maybe. Many companies are racing to provide schools and EdTech companies with tools that can analyze student writing (or speech) and provide clear, focused feedback—and even a rubric-based score. They’re a little hit-or-miss at the moment, given AI’s habit of…um…making stuff up (or “hallucinating”). But things are moving quickly in that world, and it will be interesting to see whether (or when) the tools will improve enough to be reliable without human input or interaction. Maybe the future won’t be completely hands-off for teachers, but will, instead, involve a partnership between AI and the teacher—not removing the burdens and challenges of authentic assessment entirely, but easing them and making them more manageable.
I would love to see more of what I saw in my small, Slovak town: a world in which the relationships between school and community, student and adult, are strong and clear and meaningful—where the whole town can feel like they have a stake in the skills and abilities of their young people, and can celebrate their entrance into adulthood and citizenship.
A guy can dream.
Thanks for this thoughtful analysis. I agree with much of what you have outlined. I have a few cautions about AI:
1. I also wonder about the implications for language and culture. I'm not sure it's possible or even wise to have a machine evaluating a student's family, tribe, neighborhood contexts that show up on a test. If we want authenticity then we will get exhibitions that are not about mainstream culture.
2. We don't talk enough about the flattening effect of AI. Lessons from the internet where everything moves to the mean should make us cautious.
3. Finally, in some communities the information that young people might share is private, personal, and only meant for a caring adult. It probably doesn't belong in a large language model.
Especially loved the point about the uselessness of taking one's temperature if a fever won't result in any further actions.
I imagine when I'm on my death-bed and beyond help some doctors will be giving me tests all the time. I'll ask why. If they're honest they'll say "So we can monitor the progress of your decline."