Where Evaluation Policy Stands

This week, we’ve heard from school leaders on the challenges (and rewards) of implementing teacher evaluation at the school level. We’ve heard from one of the country’s best teachers on her experiences before and after meaningful evaluation. Today, we’ll hear from three of our favorite thought leaders on the policy side of the conversation, to find out if they think evaluation reform is still worth the effort it requires—and where they think we should all be focused as we move forward.


Our contributors:

Joanne Weiss, Education Consultant

Andy Smarick, Partner, Bellwether Education

Thomas J. Kane, Professor of Education and Economics at Harvard Graduate School of Education


Joanne Weiss, Education Consultant and former Chief of Staff to U.S. Secretary of Education Arne Duncan and director of the federal Race to the Top program.

Joanne played a critical role in the Obama administration’s efforts to modernize teacher evaluation and development, first as Arne Duncan’s director of Race to the Top and later as his chief of staff. When we look back at how the recommendations in The Widget Effect gained mainstream attention, one of them was the inclusion of evaluation reform as a priority in the first round of Race to the Top. Joanne was in the thick of it. States responded to the grant competition by passing dozens of laws. We asked Joanne to reflect on why the administration did it and what she learned.

The foreword to The Widget Effect called upon the education profession to address a pervasive “culture of indifference about the quality of instruction in each classroom.” That simple indictment, on page one of the report, stopped me cold in 2009. The report painted a picture of an education system skewed toward mediocrity, where excellence was not recognized or rewarded, the struggle to improve was not supported, and incompetence was ignored. Decades of research had told us that the quality of teaching mattered more than any other in-school factor for driving students’ outcomes, yet we had created a culture indifferent to instructional quality.

The Widget Effect gave us a vocabulary to use in talking about this problem, provided concrete evidence of its existence, and offered a simple policy recommendation: make performance evaluation a real lever for differentiating the quality of teaching. So five years later, where are we? 

The good news is that most people now describe an effective teacher as one whose students learn. That simple connection represents a seismic shift in thinking.

But while the call to action of The Widget Effect changed the conversation, we are far from achieving the vision. Districts and schools still struggle to attract, develop, and retain talented teachers. In many places, performance evaluation has become an exercise in compliance rather than a driver of instructional improvement. Most districts treat their first grade teachers the same as their physics teachers, and their accomplished teachers just like their novices. Strengthening the teaching force requires not just differentiating performance, but also responding to it with differentiated support and opportunities.

As we reflect on the past five years and look ahead to the next five, The Widget Effect’s lessons remain deeply relevant—even more so in light of the instructional demands that the Common Core is placing on educators. Performance evaluation can still be a launching pad for powerful professional learning and catalyze career opportunities for accomplished teachers—but this won’t “just happen” on its own. The evaluation conversation is finally moving past weightings and ratings to more impactful questions around instructional quality and improvement. Maybe…just maybe…at this nexus of new standards implementation and performance evaluation, educators will find a way to create a culture that develops, recognizes, and prioritizes instructional excellence.


Andy Smarick, Partner, Bellwether Education.

Perhaps no one is better qualified to shed light on how states implemented teacher evaluation reform than Andy Smarick. He had a front row seat as a state official in New Jersey in the Race to the Top years. Andy is also well known for advocating a judicious view of policy change, reminding us that some things are worth preserving. We asked him whether it was the right strategy, in hindsight, to push for statewide evaluation reform rather than leaving evaluation to local districts.

On balance, I remain a supporter of statewide teacher evaluation reform. But the scale has weight on the “con” side, too. I think we need to keep those issues in mind as opportunities for course corrections become available.

I’m still in favor for three big reasons. First, The Widget Effect demonstrated that old systems weren’t differentiating teachers by performance. This ran counter to the research showing huge variation in educator effectiveness, and it ended up adversely affecting low-income kids disproportionately. A range of extant policies and practices created incentives that made it highly unlikely things would change sufficiently absent a major policy initiative.

Second, we must remember that state governments create rules related to educator preparation, tenure, seniority, certification, and, in some cases, compensation and career ladders as well. These have long been understood as consistent with the state’s role in K-12. It seems reasonable, if not necessary, for state leaders to ensure that such policies are undergirded by a system that transparently and accurately measures each educator’s influence on student learning.

Third, state governments are ultimately responsible for ensuring students have access to high-quality schools. Since teacher effectiveness is the most important in-school factor related to student achievement, it makes sense for states to create policies that assess teacher effectiveness and make such determinations part of key personnel decisions.

But I do have four concerns. The first is that teacher evaluation reform is a massive “input.” It tells districts how to do their business, instead of holding them accountable for results. This stems from the fundamental mistake state governments made generations ago: delegating, in perpetuity, the state obligation over K-12 to state-created entities called districts. We’ve known for decades now that many districts are not getting the results we want, but rather than reconsidering states’ wholesale delegation of K-12 responsibility to districts, we’ve tinkered around the edges. We’ve created policies related to standards, assessments, accountability, highly qualified teachers, school interventions, professional development requirements, salary scales, and on and on. When you’re unwilling to fundamentally reconsider the role of districts, you inevitably end up toying with inputs ad nauseam. Teacher evaluation reform is the logical extension of our unfortunate deference to the district structure.

Second, most of the public was convinced that districts and schools in inner cities and other low-income communities needed to be improved. So substantial reforms targeted to these areas generally generated political support. But much of the public was not convinced that we had statewide K-12 problems. Many families believed—and some with very good reason—that their schools were doing well or better. Statewide evaluation reform not only forced significant change in all schools, it implied that there was a problem with some teachers in every school. Before launching such a broad-front policy offensive, the reform community should’ve done a whole lot more explaining and listening.

Third, the entire enterprise elevates technocracy above history and experience. Teaching is complex, and teachers are part of a community’s intricately woven fabric of social capital. Although these new systems wisely called for the use of multiple measures, the impression they gave was what we had in place before was entirely wrong and that the new system would reduce this profession to a series of numbers in a spreadsheet. I think had we had a more thorough discussion of all of the things teachers contribute to a school and community, we would’ve spent more time overhauling observation rubrics, creating parent surveys, honing student surveys, listening to local citizens, and more.

Fourth, the policy got ahead of the practice on the use of measures of student performance. I’m firm in my support for including student academic achievement in educator evaluations. But bringing that to life is extraordinarily difficult. It’s not just about improving value-added models and student growth percentiles; it includes creating reliable student learning objectives, school and district-level assessments, and much more.

Lastly, in hindsight, the federal government probably should’ve showed humility and stayed away from this. Yes, states had the right to ignore Race to the Top and ESEA waivers (and thereby stay away from evaluation reform), but the effects of the recession and the pressures of NCLB all but forced states’ hands. I’m not averse to federal pressure in the right areas, and I remain supportive of federal competitive grant programs. But this administration pushed too far and too fast in too many areas at once. Evaluation reform would’ve been challenging in the best of circumstances, but when it was instigated by an innately clumsy Uncle Sam and tied up with changes in standards, assessments, failing-school and data-system reform, it became too much for many to bear.


Thomas J. Kane, Walter H. Gale Profession of Education and Economics at Harvard Graduate School of Education.

Tom is one of the most important education researchers of the past few decades. He and his colleagues laid significant groundwork for The Widget Effect by showing that teachers are not all the same, even if policy tends to treat them that way. In the years following the publication of our report, Tom revolutionized research once again by helping to lead the Gates Foundation’s Measures of Effective Teaching study. We asked him to paint a picture of how the evidence base has evolved in the five years since The Widget Effect and what implications he sees.

Five years after the publication of The Widget Effect, school districts around the country are working to reinvent their teacher evaluation systems. However, given the history of perfunctory evaluations, schools are still struggling to differentiate and agree upon standards. As more and more teachers are lumped into one category, there is a real danger that “proficient” will become the new “satisfactory.”

One reason is that schools have historically seen teacher evaluation merely as a way to weed out malpractice, to identify the worst of the worst. One implication of that cultural history is that teachers and principals end up setting a very low bar for acceptable performance. “Unsatisfactory” becomes a synonym for “child abuse.” It is also the primary reason why teachers unions hear “assault on teachers” whenever they hear the term “teacher evaluation.”

However, given everything we’ve learned about the importance of effective teaching, the threshold for acceptable practice should be much higher than simply “not malpractice.” Culture change could start by reminding principals that every time they decide to tenure a teacher, they are implicitly deciding to forgo 20 or more years of draft picks for the same position. That framing—which is, after all, an accurate description of what they’re actually doing—also implies a much higher standard of acceptable practice. At the least, a teacher would have to be more effective than the average novice teacher in order to warrant 20 years of not being able to hire the average novice teacher. By that standard, far more than 2 or 3 or 5 percent of teachers would be failing to meet the tenure threshold. Although it would depend upon the heterogeneity in teacher effectiveness, the average quality of the recruiting pool and the average rate of improvement after their initial year of teacher, the number would be much closer to 35 percent in most districts.

The problem won’t be solved by picking a different classroom observation rubric and providing better training to principals. As Engels might say, we must dispel the “false consciousness” which has grown up around tenure and reveal the reality underneath. How might we change the tenure process to remind principals of the choice they’re actually making? One idea would be to require principals to interview at least three new prospective candidates for every tenure slot. That way, the alternative candidates for a given position would not be hypothetical; they would have names and faces. Another idea would be to create some additional paperwork hurdles every time they seek to tenure a teacher with measured effectiveness lower than the average novice.  

However we do it, we will not be able to raise standards in teacher evaluation as long as its primary function is seen as a means for preventing malpractice. Malpractice is far too low a standard.  

Imali Ariyarathne, seventh-grade teacher at Langston Hughes Academy, stands in front of her students while introducing them to the captivating world of science

Imali Ariyarathne, seventh-grade teacher at Langston Hughes Academy, introduces her students to the captivating world of science.

About TNTP

TNTP is the nation’s leading research, policy, and consulting organization dedicated to transforming America’s public education system, so that every generation thrives.

Today, we work side-by-side with educators, system leaders, and communities across 39 states and over 6,000 districts nationwide to reach ambitious goals for student success.

Yet the possibilities we imagine push far beyond the walls of school and the education field alone. We are catalyzing a movement across sectors to create multiple pathways for young people to achieve academic, economic, and social mobility.

Learn More About TNTP