Monday, February 1, 2010

In which standardized testing is mostly nonsense

This post started as part of an email conversation about standardized testing, and the problem of using standardized tests to measure "value added" by teachers and schools -- this measurement system, in addition to being a key factor in the allocation of federal No Child Left Behind and Race to the Top funds, is an important part of efforts to give teachers financial incentives for good teaching.

In most objections to the use of standardized tests to measure student academic achievement or, indirectly, teacher or school quality, the argument tends to hinge on whether or not these tests are actually good measures -- whether students are really rewarded for the amount of knowledge they receive, whether we weight questions correctly to convey the quantity of knowledge they represent, etc. But I think the problem goes deeper than this.

Because knowledge isn't directly measurable, it's difficult to know what "quantity of knowledge" entails -- number of propositions embodied in the answer of a question? Number of logically independent propositions? Number of "atomic" propositions? All of those are hard to measure. "Amount of time it takes a person to learn something" seems like the easiest one, but then you need to somehow specify under what circumstances, and there's a definite arbitrariness there, too.

The point is, given that "quantity of knowledge" isn't well-defined, when people weight tests (assuming they're consciously thinking about questions like these), I think they tend to weight them based on one of two things:
  1. try to make the test predict success in some future endeavor -- usually, a future course or category of courses (e.g., "Algebra 2" or "college").
  2. try to make the test measure effort/success in a previous class. For example, people might weight the test based on the amount of time teachers tend to spend on a particular topic.
Usually, I think, test weighting comes from a combination of these things through a more circuitous route. Standards boards meet, and come up with a list of things students ought to know by grade X (which, since there's no absolute sense in which we need to know anything by a particular age, must be based on a combination of what people tend to learn by grade X and what they need to know to succeed in a typical course of study in grade X+1). They implicitly weight
these expectations by classifying them in a hierarchy -- "standards" are assumed to be roughly equally weighted, as are "benchmarks" within each standard, "performance descriptors" within each benchmark, etc.

But, of course, there's no absolute sense in which there's a well-defined quality that these tests measure -- at best, they measure a specific set of historically contingent skills, weighted in a
specific, historically contingent way. And you can tell these aren't absolute by looking at variation between different state tests. In Illinois, for example, the ISAT was "re-standardized" between 2005 and 2006. Pass rates shot up by 8% statewide, and although only 32% of 8th graders in Chicago passed the math portion in 2005 (I couldn't find statewide data divided up by subject and grade level), 65% passed in 2006.

Clearly, these aren't testing the same thing in the same way. Asking which test is a better measure, though, assumes it's agreed-upon what they're supposed to be measuring. Clearly they're weighting different types of questions differently -- in particular, they seem to be weighting easy questions more heavily now than they used to.

Fundamentally, I think what happened is when we started referring to "standard-referenced" tests as distinct from "norm-referenced" tests, we implicitly convinced ourselves of the lie that those standards -- and the way they're weighted -- aren't themselves based on norms. Because what does "third-grade knowledge" mean other than knowledge that successful third-graders tend to have?

This also has to change the way we think about value added, and therefore, about teacher rewards. The fundamental problem here is that it's not obvious that, in general, equal test gains represent equal teacher quality. Under equally good teaching, will students at all points on the spectrum progress equally (in other words, is it the same amount of work to bring a student from a 30 to a 60 as it is to get them from a 60 to a 90)? If so, then those gains should reward teachers the same amount. But if not -- if, for example, getting someone from a 30 to a 40 takes as much work as getting someone else from a 60 to a 90 -- we should be weighting gains accordingly, so we're not paying teachers more for teaching smart kids.

So in order to talk about how to translate test gains into teaching quality, we need to know how well students at different levels do under equally good teaching. But it's tough to talk about "equally good teaching," too -- again, we're assuming there that there's some absolute measure of teaching quality.

One thing we might do is look at specific teachers teaching students at different levels -- is there a correlation between improvement of high-achieving students and improvement of low-achieving students? How good is that correlation? Suppose that it turns out that, in general, if a teacher tends to bring students' grades from 60 to 90, they tend to bring other students' grades from 30 to 50. Does that mean that we should weight those two improvements the same way, when judging other teachers' performance? Or do teachers who teach both good students and bad students tend to prioritize their good students? If teachers do prioritize their good students, we're excusing bad teaching of bad students, and blaming their smaller gains on the
students rather than their shitty teachers.

I think what I'm saying is, when you try to create tests to measure things that can't be measured directly, you have no way of discerning test validity. It's a deeper problem than "these tests aren't valid" -- it's that we don't know what it would mean for these tests to be valid.

No comments:

Post a Comment

 

CC0
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.

However, he believes you have a moral obligation to comply with the restrictions of the Attribution-Noncommercial-Share Alike license.

Look here for clarification.