New Mexico's Teacher Evaluation System Needs Updating

by Greg Butz, a 12-year veteran educator who began teaching internationally in Egypt & Hong Kong. He now teaches at a charter school in Albuquerque. He is a Teacher Liaison, was on the NM Social Studies Dream Team, and currently working towards an MBA in Educational Leadership at UNM. In the classroom, Greg believes in authentic connections, with an emphasis on student-driven project-based inquiry.


Greg with New Mexico Governor Michelle Lujan Grisham


Not All Student Achievement Data Is the Same

Student assessment was originally designed to measure student progress, not to be used in teacher effectiveness ratings. Ongoing conversations vacillate across the country on the exact dimensions that should be used in teacher evaluations. Recently, New York took steps to separate state data from teacher effectiveness scores, leaving it up districts and collective bargaining agencies to determine if it should be used.

I personally don’t find it unreasonable to evaluate teachers using required federal and state standards; but how are we collecting this data in evaluation of teachers?

I empathize with teachers who feel at the mercy of student test performance, which doesn't always correlate to long-term student success. I also empathize with teachers who remain dedicated in schools that have high student absenteeism issues, stemming from a myriad of reasons. Teachers working to close the opportunity gap are my heroes!

And I listen to teachers across New Mexico make excellent arguments, one stating, “I would be in favor of teacher accountability, if student accountability was also tied to it.” Not to be outdone, one remarked, “what about parent accountability?!” It takes a village and indeed more should be done to engage the entire community.

In New Mexico, PARCC scores impacted math and English teachers, but neglected other subject areas. Science teachers had student data from grades 4, 7, and 11 on their reports from the SBA exam. Some teachers were assessed by multiple student assessments. One said they had five different tests that went into their report; I had only one. Some teachers complain that end-of course exams (EOCs) aren’t rigorous enough; others complain about the randomness of the content.

In short, we need to address the inequity of student test data in teacher evaluations.


Despite the issues that come with annual assessment, New Mexico's PARCC exam set a benchmark to measure student understanding nationally, as mandated by the Every Student Succeeds Act (ESSA). Yet, my evaluation is not linked to PARCC scores.

Instead of having nationalized recognized student achievement growth data, my results come only from an EOC, based mostly upon memorized content. As a social studies teacher, I am expected to incorporate English Language Arts (ELA) reading and writing standards, but these aren’t foci of the EOCs. Increasingly, teachers are using project based inquiry, Socratic seminars, learning by design, and incorporating 21st century skills into their classrooms.

Our tests should represent this higher order thinking, not rote memorization.  


Items on Social Studies EOCs

Let's dig further. In Ancient Civilizations (6th grade) EOCs, students must demonstrate understanding spanning 5,000 years – from Ancient Mesopotamia to Medieval Europe. One question in the released 2017-2018 Social Studies Blueprint materials asked sixth graders:

“How did the Black Plague affect Medieval Europe?”
a. It weakened the feudal system
b. It ended the crusades
c. It improved the health care system
d. It increased the general population

This question is loaded with content that must be memorized: “feudal system,” “crusades”, “black plague”, and “medieval.” Have we considered how an English Language Developing (ELD) student might attempt this question?

And here’s the clincher, according to the Oxford University Press, the Black Plague was the impetus for improvements in the health care systems, even though it’s the wrong answer on this form! For 8th grade, according to EOC released items, one question asked, “Which term is used to define westward migration and expansion of people during the 1800s within the United States?” Answer: Manifest Destiny.

Beyond the obvious cultural insensitivities that come with a one-sided definition of “Manifest Destiny,” my students shared frustrations that instead of seeing this question, they saw a painting that they needed to recall.

A series of grade-level. appropriate, and thought-provoking questions might have been rendered instead:

  • “What were the technological advances that enabled American western expansionism?”

  • “How did expansionism harm the environment and communities of Native Americans?”

  • “Why do different peoples, have different reactions to the term ‘manifest destiny?’”

These higher-order Bloom’s taxonomy questions probe multiple-perspectives, instead of a one-sided ethnocentric definition. Questions like these require extended responses.

So is it fair (or accurate) that student success on 30-40 multiple-choice random or recall questions, represent 40 percent of a teacher’s evaluation in non-PARCC subjects?


Proposed Changes to NMTEACH

The existing teaching and planning domains in NMTEACH are understandable with proper coaching. Below are several proposals that the new administration should address to immediately benefit the system:

Proposal 1: Exemplary Should Count As Bonus Points. The exemplary category for teachers should be adapted to give bonus points on teacher evaluations for implementing and performing at a school or district-wide level. Macro-approaches are the job of leadership and administration, not a classroom teacher. Furthermore, no independent observer can see those school-wide approaches in an hour lesson anyway. The baseline for excellence should be the “highly effective” category, unless we are looking for ways to penalize teachers not operating with the scope of instructional coaches or administrators.

Proposal 2: Evolve the Student Growth System. The mechanics calculating student-growth need to evolve. Each year, teachers are assigned different groups of students. Yet, teachers still are graded by the performance of a grouping two years prior, despite the teacher no longer instructing them. Let’s assume in three consecutive years, student scores are -0.59, 1.00, and 1.30. Student test scores are improving each year, yet this would still indicate reduced growth. But immense assumptions are made.

If each year measures completely different groups of students, it’s not tracking student growth at all – instead it tracks test performance, from different groupings of students. To properly measure student growth, we could track the progress made by the same group of students. One method is to have standards-aligned tests administered at the beginning of the year, with a similar test administered at the end of the year to track specific student growth intra-year. Perhaps more importantly, only the current year data should matter on each year’s evaluation.

Proposal 3: Get out of a Fixed Mindset. Instead of labeling teachers on an “effectiveness” scale, the data should be used to target teachers who need stronger coaching. When I was a teacher liaison, I benefited from excellent coaching, even receiving a copy of Carol Dweck’s Mindset from a helpful NMPED staffer. If we want teachers to evaluate student progress based upon growth-mindset principles, shouldn’t we be doing the same in our evaluation of teachers?

Proposal 4: Reward Student Success. Teachers feel punished when student scores don’t show growth. What have we done to systematically ensure this data represents the best of student capability? Have we ensured that students are responsible for their scores? We could accomplish both by drastically rethinking how to maximize student buy-in, linking extrinsic rewards to their success.

Proposal 5: Use Better EOCs. In light of the highlighted issues of EOCs, at minimum we need to rethink the essential skills that students need —shifting away from memorization of content to focusing on enduring knowledge and skills required in the 21st century. I suspect EOCs were an attempt to bring equity to non-PARCC testing classes, but this created a reliance on weaker tests. Retooling EOCs should be paired with better student data like PARCC, or its ESSA-approved replacement. (Or simply eliminate EOCs altogether).

Secondly, many courses outside of English and math share a responsibility to support the development of those core subjects. But how do we measure this? And should we measure this? The recent elimination of PARCC via executive order could usher in a reliance on poorer quality assessments with a lower bar for student success. We must ensure that the assessments used to measure student growth represent the best possible data.

Proposal 6: Reward Extra Days Present. Teachers could receive financial incentive for being in their classrooms for each day under the ten allocated absences. Instead of monies going to substitute teachers (already built into budgets), teachers should receive those bonuses if they do not to use their days. And the best news is that this would result in zero extra cost to districts, incentivizing even more quality hours in the classroom.

Proposal 7: Explain and Address “Ineffectiveness”. NMTEACH has been in development since 2011. The same mechanisms used to declare teachers as “not effective” starting in 2014, now show 1,000 more effective teachers in 2018, according to the National Council on Teacher Equality. And yet, according to the same organization, 28.7% of all New Mexico teachers are below effective -- a statistic that is more than double the next closest state, Oregon at 11.7%. How is this possible?

It’s because PED never transparently addressed lingering issues with the system. I know of several teachers whose evaluative scores swung from “highly effective” to “minimally effective” one year later with little explanation. Furthermore, NMTEACH also aggressively measures teachers against an unrealistic Exemplary domain, which should never have been used to evaluate a classroom teacher. This skewed our evaluative data.

A Path Forward

We must ensure the that the data we use to assess teachers are equitable, fair, and properly promote success for all teachers. The rallying cry of today’s educator advocate centers around “accountability.” But was NMTEACH transparent enough and equitable enough to be held accountable? I would argue no and for these reasons it needs to be reformed.