Evaluating Teaching

Every two years, Queen’s evaluates each tenure-track professsor’s “merit,” defined by evaluations of their teaching, research, and service work. This affects salary increases – more in early years than in later one, because of an “abatement” (what some might call a “clawback”) that reduces merit pay for more experienced professors. When I first arrived in 1984, a senior (“full”) professor explained that tenure-track faculty were generally hired at a lower-than-average salary, and merit pay (funded by the difference in salary between retiring professors and junior ones) made up the difference as time progressed. So one naturally has to ask: how are each of these components evaluated?

It turns out that evaluating teaching is problematic. A very significant element at many institutions is student evaluations, and students have biases like anyone else. There is quite a bit of compelling research that shows that female-presenting instructors score lower on student evaluations than male-presenting ones, even when the only indication that an instructor is female is a video where they introduce themselves. There is also evidence (based on fewer studies at this point) that non-white instructors face a similar bias.

There is also the “Dr. Fox effect,” named after an initial experiment in the 1970’s about student reactions to “engaging / enthusuastic” lecturers versus “informative” ones; they evaluate the first kind higher. Many years later, there have been more studies that show students value informativeness less than what some instructors disparage as “entertainment:” personality, charisma, fluency, non-verbal behaviour, and physical appearance.

The fundamental issue is: what is a fair way to evaluate teaching effectiveness? Some have suggested that having experienced teachers sit in on lectures would be better, but the excessively heavy workloads at universities make it difficult for people to make time to do so; an hour at a time, several times during the term, for all instructors being evaluated, might at first seem like a small commitment, but it is not. You’d need to observe 2-3 lectures in case one just happened to be a bad day. In my School at Queen’s you’d need to do this for each of about 30 professors, depending on how many have teaching reductions or are on sabbatical, which adds up to 90 hours over a term. This is more than two full weeks of work at a pace that is humane (which, generally, academic workloads are not). It sometimes happens when the instructor is up for tenure or promotion, or when they are looking for recommendations for a new job (if, for example, they were denied tenure and haven’t given up on academia yet).

However, that’s a means of evaluating instruction, based on conduct of lectures, not a criterion, a measure of effectiveness. There are so many more issues, hard to measure, that are equally if not more important. How clearly are the course objectives laid out? How well do assessment tasks measure student mastery of the material? How well to lectures and other resources prepare students for the assessment tasks? And more fundamentally, how much and how well do students learn?

The last, which might seem best for measuring teaching effectiveness, is not as simple as looking at grades. You at least need to measure what students were capable of before they class started, which is very difficult in any field where (a) the course introduces fundamentally new material in which many students would have no background, while (b) a nontrivial number of students have extensive self-taught experience. This turns out to be fairly common in some subsets of Computing, especially those with a significant practical component such as programming or other technology.

Besides, there are at least four different philosophies of what grades are for (which might be the subject of some future essay), only one of which is “how much students improved.” Grades are also strongly affected by student attitudes; engineering students in particular so overworked that they are often forced to compromise how much effort they can put into some courses; all that matters is passing, (“five-O and go”) because becoming an engineer requires getting a degree and getting the “iron ring” at the end. Grades aren’t irrelevant, but they are less important.

Department heads do know about these problems, so, as a result, there’s a tendency for teachers to get 10/10 on teaching merit – aside from award-winners who sometimes (but not always) get 12/10 in the year they get the award, and teachers students complain about vocally, who might get 7/10.

I do not have a solution, nor, I think, do most universities.

Here are the sources I consulted while writing this essay:

  • Dr. Fox effect, Wikipedia article, accessed 2022-04-03. A more complete explanation of the effect, although with fewer citations than Wikipedia editors are happy with.
  • Exploring Bias in Student Evaluations: Gender, Race, and Ethnicity, K. Chavez and K.M.W. Mitchell. Cambridge University Press. A research paper with a small experiment, whose value to me was in the review of previous work in the field.