About this blog Subscribe to this blog

Bruno: Actually, Statisticians Are Cautiously Optimistic About VAM

11442225495_9d9cc1cbc4_nIt's always nice when experts come together to help to articulate and clarify whatever scientific consensus exists around an issue, so I was glad to see the American Statistical Association put out a report last week on the promise and peril of value-added modeling of educational effectiveness.

Interestingly, however, if you were to hear about this report only from the staunchest, most ideological opponents of VAM, you would think it says something else entirely. Valerie Strauss, for instance, claims the report "slammed" the use of VAM to evaluate teachers and Diane Ravitch seems to think it is a "damning indictment" of such policies.

The report itself is not nearly so hyperbolic.

For a useful summary check out Stephen Sawchuk, but the report itself is a mere seven accessible pages so I encourage you read it yourself.

The bottom line for the ASA is that they are optimistic about the use of "statistical methodology" to improve and evaluate educational interventions, but current value-added models have many limitations that make them difficult to interpret and apply, especially when evaluating individual teachers.

If you read the report yourself - and don't just scan for the bits you find convenient - you will see that at various points the authors point out that value-added scores correlate with future student outcomes (academic and otherwise) and teachers' own scores in future years. They also discuss ways to improve the reliability of value-added models, and are especially optimistic about using VAMs to help students and schools monitor student progress.

The report does, of course, emphasize that caution is needed when interpreting and using VAM data, but even here it is important to remember that all methods of evaluation have limitations. That reformers often do not understand those limitations is no excuse for pretending the alternatives are not limited in their own ways.

 None of which is to say this report is a major vindication for reformers.  Make no mistake: this report throws a great deal of cold water on the sorts of VAM proposals that prominent reformers are often associated with, which are often rooted in deep misunderstandings of statistics and labor markets and promoted with over-the-top, unrealistic rhetoric.

Nevertheless, it requires a considerable degree of cherry-picking and motivated reasoning to suggest that this report - which explicitly refuses to condemn any particular policy - is a blanket rejection of value-added modeling.

 On the contrary: the advice in this report is entirely consistent with the frequent use of VAM in education, even in some cases as a component of individual teacher evaluation. Such accountability regimes may or may not be optimal, but they are by no means ruled out by the authors' conclusions.

In their over-eagerness to give reformers a black eye, reform critics are obscuring considerable nuance. That might be politically savvy, but it's probably not ideal for improving education policy. - PB (@MrPABruno) (image source)

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

These quotes from the summary seem pretty unarguably damning, Paul, given that teachers' careers and lives are being ripped apart by the devotion to VAM.


"VAMs typically measure correlation, not causation:
Effects – positive or negative –
attributed to a teacher may actually be caused by o
ther factors that are not captured in
the model."

"Under some conditions, VAM scores and rankings can
change substantially when a
different model or test is used."

"Most VAM studies find
that teachers account for about 1% to 14% of the va
riability in test scores, and that the
majority of opportunities for quality improvement a
re found in the system-level
conditions. Ranking teachers by their VAM scores ca
n have unintended consequences
that reduce quality."

As I said, because VAM - like any potential evaluation method - has limitations, it is possible to misuse. But, again, this report also explicitly refuses to make any specific condemnations. Clearly, many readers are eager to make sweeping (and to varying degrees justified) inferences from this report about their least-favorite specific policies, but those are their own condemnations, not condemnations made by the report.

If the 'anti-reform' movement is adopting the positions laid out in this report - about the substantial promising potential uses of VAM - I would consider that good news.

Hey Paul,

I'm glad you wrote about the nuance of the ASA piece, and I agree with you that opponents of VAM have seriously misinterpreted these results. While I appreciate your note that "this report throws a great deal of cold water on the sorts of VAM proposals that prominent reformers are often associated with," I think it's important to recognize that the report suggests that Rhee-type reformers are considerably more wrong about VAM than Ravitch. The report very clearly indicates that nearly every current use of VAM is highly problematic, and as a result, I don't find this sentence particularly accurate: "the advice in this report is entirely consistent with the frequent use of VAM in education, even in some cases as a component of individual teacher evaluation."

Diane Ravitch's piece that you linked is pretty accurate, actually, and while I agree with you that her supporters often incorrectly lambast the method when they should be lambasting its common usage and interpretations, the ASA's statement (and all the research on VAM that I've seen) is a pretty strong indictment of typical reform packages. Would it be wrong to call the report a "damning indictment" of the concept of VAM? Absolutely, and critics of the typical reform agenda need to recognize the value of further research about and consideration of VAM. I just think we need to be wary about suggesting that the two "sides" of this debate are equally misguided.

Ben

@Ben - There's obviously a case to be made against 'typical' reformy implementations of VAM based on the logic of this report. Sadly, Ravitch's "side" - including Ravitch herself - frequently blur the line between reasonable and unreasonable critiques (how many times has Ravitch herself referred to VAM per se as 'junk science'?), so it's a bit hard to judge how misguided they really are.

If that "side" of the debate wants to come out explicitly as believing that there are potentially many reasonable uses of VAM in education, I'd love to hear that but it's not my sense that's where they want to take the discussion.

The report may not say outright that VAM shouldn't be used, and perhaps some cheerleaders are overly eager - but do you think the caveats the report includes come remotely close to any existing use of VAM in evaluations? They also don't address what seems another obvious problem to me - the non-comparability of conditions in a school from year to year.

@David - I think you can make a case that some VAM implementations - e.g., IMPACT - are consistent with the information in this report. Other implementations may not be, of course. But many non-VAM evaluation systems currently in place also do not avoid the limitations mentioned in this report (nor other limitations associated with, e.g., classroom observations).

Wow, Paul - you're quick on the comment replies! I was just going to add a clarification, that I don't mean year-to-year comparisons are impossible, but rather, that no one has proposed a method for doing so. If you don't control for changes at the school, how can you compare the teacher's performance from year to year? For further discussion on that can of worms, see: http://accomplishedcaliforniateachers.wordpress.com/2010/09/02/open-letter-ca-public-officials/

And is IMPACT really doing what's recommended here? (Not a rhetorical question - I don't know). Do they provide the margin of error? Do they account for year-to-year changes in the assessment and indicate what effect the changes might have on measurements?

Yes, it's true that no method is perfect. The point I would make is that some imperfect methods are low-cost, easy to repeat, and should lead to wide-ranging conversations about all the things we value in teaching and learning. Meanwhile, VAM is expensive, not easy to repeat, and leads to something rather opaque that, at its best, would lead to narrow discussions about some of the least valuable information about students. I'm perplexed about why any teacher, given the current or recent standardized measures in education, would want to even entertain the possibility of VAM in evaluation. Besides the fact that I personally find it distasteful and insulting, I have a hell of a lot more studies and policy statements backing my point of view - from the fields of education, statistics, mathematics, psychology, and educational measurement and research. The best data supporting VAM in evaluation seems to come with way too many caveats, and mainly from a small subset of economists studying the matter. (Even the MET study had to manipulate typical school behaviors to ensure randomized classes, and then they had to admit they didn't actually achieve the level of randomization they intended).

One other point regarding the validity issue: James Popham has laid out quite clearly some basic principles regarding valid use of assessments having to be instructionally sensitive. To draw valid inferences from a measure, we must, according to Popham, allow teachers to know what's on the test, and ensure that student performance on the test comes from teaching. That's just not the case with standardized tests as they've been designed and used to this point.

Should we keep an open mind about the future? I suppose so. I might even allow that standardized test results could be a small part of a conversation about teaching and learning, but again, correlation is not causation. There may be many ways to explain the test scores beyond attributing results to individual teachers. (The interdisciplinary approach of CCSS should make VAM even more complicated and murky if SB and PARCC assessments will be used as a measure of one teacher's effectiveness). I would object to assigning a certain percentage to scores - I don't even like the idea of percentages assigned for any aspect of teaching - it's inherently arbitrary and counterproductive to move non-quantifiable practices into numerical forms. (I'd say the same is true for students' grades, but I do what I have to do for now).

There are a lot of claims there, not all of which I think are well-established empirically. (For example, is it really the case that VAM is substantially more expensive than classroom observations?)

If you have studies that compare the statistical validity/reliability/'repeatability' of VAM to other methods of evaluation, I'd be interested in seeing them. I'm not sure I've ever heard about a limitation of VAM that doesn't apply, to one degree or another, to (e.g.,) classroom observations, so the relevant question is how do they compare along a variety of dimensions, and would multiple measures be better - including more fair to teachers - than any single measure alone?

I don't find VAM distasteful or insulting, for the record.

Re: expenses, you're right, it's not that observations are cheap, so let me clarify. Whatever the cost of evaluators might be, it's a fixed cost for a given year, typically. If you send them to any given classroom two times, or three, or four, they're paid the same amount. If you think there's a problem with VAM, you can't just give the test again and run the calculations again. That's the least of the problem though.

I think you're dodging my main point though - VAM is not one of several slightly flawed methods; the accumulated problems with VAM make it the worst option, and significantly counterproductive. Of course no method is perfect, and certainly, multiple measures should be used. As Ben Spielberg mentioned on Twitter the other night, probably multiple evaluators and observations too.

Meanwhile, VAM suffers from issues of instability, invalidity, lack of transparency, and lack of instructional sensitivity. I'd be curious to know why you think a measure that fails to account for so many variables and has so many incomprehensible assumptions is still useful - at all. The studies that argue in its favor seem to rely on multiple years worth of data. Do you have multiple years to wait for a minimal amount of marginally useful data from one test? Even if so, VAM gets us focused on the weakest types of assessments we have for a rather narrow subset of student skills - and we can't even review the assessment itself. Will SB or PARCC be better? Years from now, we might be able to measure that.

I never suggested that you, or teachers generally, find VAM distasteful or insulting, for the record - just spoke for myself. But are you really that sanguine about it? Other than your feelings that it might be helpful, what do you rely on? Individual (and flawed) economists' studies like MET and Chetty, et. al.? It's a quick drop off after that - no professional body or organization has endorsed VAM for teacher evaluation. The Economics Policy Institute warned against it. When assessment experts at ETS and the National Research Council won't endorse VAM for teacher evaluation, why would you? AERA, NCME, and APA all say you can't draw valid conclusions for one purpose from a measure that was only validated for other purposes. Are you, or any policy makers, qualified to dismiss that basic tenet of measurement validation, or are your students being assessed using tools that were validated for teacher evaluation use? When an assessment expert like James Popham says teaching, like many other professions, is best evaluated through the expert judgment of fellow practitioners, and points to basic flaws like instructional sensitivity, it doesn't seem like much of an answer to just say, "well, nothing is perfect." Right, observations may be flawed, but we can talk about them on equal footing, repeat the observations, ask other people to observe, etc. We can't in our own evaluations even look at the tests, have an informed discussion about the VAM formulas, or experiment with different measures, different VAM formulas.

You're a science teacher, right? Seems to me like VAM a lab experiment where you heat up a certain substance using a certain burner, and then predict how quickly it will heat up next time. Except the mass of the substance is going to be different. Or the density, or composition. Or the starting temperature. Or the fuel used in the burner. Or the distance between the burner and the substance being heated. Or there might be another burner rather nearby this time. How many changing and unrecognized variables are you comfortable with? Please don't tell me "other methods are flawed too." Can VAM stand on its merits? At the level of the individual teacher evaluation, I don't see how it can.

@David - What I'm saying is that it's far from established that an evaluation system that incorporates VAM is "the worst option and significantly counterproductive". Once we acknowledge that all evaluation methods have limits, it's simply a matter of determining - rather than assuming - which methods or combinations of methods work best for which purposes.

I'm not sure why you think I'm "endorsing" VAM for evaluation. I'm simply declining to rule it out. (Indeed, in this respect - declining to endorse - I am in many respects hewing closer to the positions of many of those expert organizations than you are.)

It's not just that "nothing is perfect", it's that (1) I see no reason to assume a priori that the limitations of VAMs are worse or more insurmountable than the limitations of other methods and (2) there is evidence that VAMs capture useful information with some reliability and (3) there is some preliminary evidence that evaluation systems incorporating VAMs have at least some positive effects. (I'm not nearly so quick as you to dismiss these studies out of hand, and I am typically suspicious when one "side" in a debate comes to the conclusion that all of the studies inconvenient for their position are "flawed" but the studies convenient for their position are mostly fine.)

If I can't point out that "other methods are flawed, too" (even though you seem to agree with this), presumably it's unfair to just stipulate that the flaws of other methods can be overcome (or are worth the trade-offs) but the flaws of VAMs can't be. "Nothing is perfect" is obviously not enough to justify any particular evaluation method, but nor is "let's assume only the flaws of my preferred method can be worked out or are worth the trade-offs".

It's not obviously the case that VAM can stand on its merits. But there are reasons to think it might be able to, so it's by no means obvious that it can't. You are clearly certain about your position but remember that I am merely arguing for agnosticism, which I think is the most justifiable position on the basis of the facts available.

Paul, I'm fine with agnosticism as a personal stance in a blog discussion, but not as the basis for high stakes policy decisions. And I'm not merely assuming VAM in evaluations would be a bad idea: I have cited a significant body expertise, people who know far more about this than I do, and more than any teacher I know. They - not just I - suggest that VAM isn't appropriate for use in evaluations (at this time). (You haven't cited anything yet, by the way).

You wrote: "(2) there is evidence that VAMs capture useful information with some reliability" - How useful is it? To whom? For what purposes? Does it come to teachers and evaluators in a timely manner, in form they can understand, explain, and use to make effective instructional decisions that really meet students's needs (not just their need to score well on tests)? Teachers in NY and TX who've been subject to this kind of system have reported that their VAM changes more according to changes in their school, grade levels or teaching assignments, and they have no idea how to account for significant changes in years that seemed more similar/stable. Would you really want part of your evaluation based on assumptions about typically minor fluctuations in your students' adjusted CST scores? What if I came to your school, taught the same group of students, and turned out to be truly awesome - or awful? How would your VAM scores account for my impact on student literacy (or test taking savvy)?

When I requested that you "don't point out all methods are flawed" I didn't mean to limit your options in the big picture - just to suggest that you've made that point clear and no one disagrees with it. Given all of the failings of VAM and the numerous sources I cited, I'm just trying to understand why you seem reluctant to concede or even acknowledge *how* problematic this is. Can you acknowledge Popham may have a point about instructional sensitivity? What about all the other organizations and studies?

I'm not saying a priori that studies I disagree with are flawed or that I "dismiss these studies out of hand" - but I don't think equivocating about both sides is justified at this point. There are whole professional organizations with expertise in this area lined up on one side of the issue, and a few economists with questionable methods in non-peer reviewed studies on the other side. Isn't that enough to suggest we should hold off - at least for now - on high-stakes use of VAM?

Does the NCME/APA/AERA position hold any water with you, or are you comfortable claiming validity where they all agree it can't be done out of principle? What about the science experiment analogy? Does that challenge your thinking at all? Go ahead and push back at my thinking too, tell me what I'm missing there. I hope that if you do, you can acknowledge parts of it that make sense too.

David, several things:

1) Agnosticism as a personal stance has implications for policy decisions. I would staunchly oppose the sudden, widespread adoption of VAMs as primary methods for evaluating teachers. But agnosticism also implies that there is more to learn and we're probably not going to learn it unless people are willing to experiment with it on a small scale.

2) How useful VAM info is and to whom is precisely what we should be agnostic about. Also, see #1.

3) I've personally cited this ASA report and you've cited some of the major pro-VAM work - i.e., Chetty and MET. Those all cut in favor of my agnosticism, even if critics are a bit overeager to dismiss them. ("Experts differ in their opinions" is a point in favor of agnosticism, not certainty.) If you want additional cites of smaller-scale studies, the literature is full of them (they tend not to get the sort of prominent Ravitch/Strauss treatment less-optimistic studies do), and lists of references in those other, larger studies are a good place to start. I will only note for now that Chetty wasn't even the first to find persistent VAM effects: http://aer.sagepub.com/content/48/2/361.short

4) I remain curious about studies that subject existing evaluation methods to the same sorts of highly-critical litmus tests you are insisting we subject VAMs to. My suspicion is that such studies are rare at best because existing evaluation methods were widely adopted without anything near the level of research scrutiny applied to VAMs.

5) I don't want my evaluations based on anything unpredictable or opaque. And yet, even though I've never been evaluated under VAMs, the content of my evaluations often varies depending on my evaluator, is difficult to interpret, and is often of questionable practical utility. (This is true even though the overall evaluation is always positive!) Once again, all evaluation methods are flawed.

6) I continue to point out that all evaluation methods are flawed because people seem to keep forgetting it by pointing out those flaws without acknowledging that those flaws are typically not unique to VAM. Also, see #4.

7) I'm not "conceding or even acknowledging *how* problematic" VAMs flaws are because you haven't established how problematic they are. This is in part because it is not fully established in the research literature (see #2) and because you have not established a baseline for comparison (see #4 - the magnitude of a downside always has to be measured in terms of "compared to what?")

8) I'm not sure what you mean about Popham and instructional sensitivity. I know Popham has expressed skepticism that existing *tests* may not be able to provide useful information about teacher quality, but I've never been sure why he thinks that and some of Chetty's work on long-term outcomes (no worries, now through peer review!) suggests he may be wrong as a matter of fact. Or he may not be! (Remember, agnosticism!)

9) To clarify: "agnosticism" means, among other things "not claiming certainty". So I do not "claim validity". I'm agnostic precisely because the (e.g.,) AERA position holds water with me. I just recognize that, in fact, the "other side" consists of more than "a few economists with questionable methods in non-peer reviewed studies".

10) Analogies are probably not useful here, as they tend to pack in a lot of implicit assumptions and obscure more than they illuminate. Better to discuss issues directly.

The comments to this entry are closed.

Disclaimer: The opinions expressed in This Week In Education are strictly those of the author and do not reflect the opinions or endorsement of Scholastic, Inc.