[Update: This post mostly deals with how instructors can best use the problematic instrument of student evaluations. A recent, accessible post for administration who are determining how (or if) to use student evaluations can be found with Research on Student Ratings Continues to Evolve. We Should, Too. We both cover many of the same resources.]
Having just finished another summer of teaching, last week I participated in the seemingly timeless ritual of passing out and administering student course “evals.”
Over four decades of research has shown, however, that student evaluations of testing (SETs) are poor and often problematic measures of teaching effectiveness. Not only do SETs exhibit systematic bias, the data is also often statistically misused by administration and instructors themselves. These problems forced one recent study to flatly proclaim that “SET should not be relied upon as a measure of teaching effectiveness,” and that “SET should not be used for personnel decisions.”
When we accept that SETs are neither statistically reliable (consistent across measures), statistically valid (testing what they claim to test), or appropriately applied, we are left to question if student evaluations offer any viable information for the instructor.
I suggest that student feedback, especially written comments when provided with proper instructions, can open crucial lines of communication with the class and can be used in a limited capacity – in coordination with other measures – for instructors to critically self-assess. I offer end-user advice for university instructors on how to best prepare students for creating constructive feedback and utilizing that information to become more critically reflective and effective teachers. I draw upon scholarly research and recently implemented institutional initiatives, illustrated with personal practices, to comment on how best to incorporate student feedback into our teaching.
(1) A Broken Instrument – Overview
I take as axiomatic that SETs as currently designed and interpreted are poor proxies for measuring teaching quality or effectiveness. For example, a range of studies have shown that among the best ways to predict student evaluations is to examine the responding students’ interim grades. In other words, it has been shown that students’ anticipation of a good course grade is highly correlated with positive teaching evaluations. This conflation between grade expectation and teaching effectiveness is just one of the reasons the validity of SETs have been called into question. Clearly, instructors can also engineer positive feedback by “teaching to the test” (instead of having students do the more difficult work of learning the skills required to do well on a test) or having generous grading policies, among other tactics.
One of the more prominent areas of research has focused on gender and racial biases exhibited in SET data. Dozens of empirical research papers point to statistically significant biases against female instructors, and one recent randomized, controlled, blind experiment at a US institution bolstered these findings. Regardless of student performance, instructors perceived as male (the experiment was performed with an online course) received significantly higher SET scores. This holds true even for measures one would have expected to be objective, such as perceived promptness in returning graded assignments – regardless that both male and female instructors returned assignments at the same time. In aggregate, studies have shown that both women and men are as likely to make biased judgments favoring male instructors.
Additionally, there is evidence suggesting that students who rate their instructor’s effectiveness highly and who take subsequent advanced classes will perform more poorly (i.e. receive worse grades) than students who rated their previous instructor’s effectiveness as low. This means that more effective instructors can actually be evaluated more negatively by students in comparison to their less effective teaching counterparts. Part of this poor evaluation may be due to using more challenging active or deep-learning strategies that ultimately have been shown to be more effective techniques for teaching, but sometimes elicit active student resistance.
Despite their ubiquity on college campuses, it has been shown that SETs do not primarily measure teaching effectiveness, but instead measure student biases and overall subjective enjoyment.
I do not mean to attempt to convince skeptics of the reliability of this stunning research; there is plenty of research available to comb through to make your own opinion. One can view the online bibliography of gender and racial bias in SETs, regularly updated by Rebecca Kreitzer, here. Additionally, there are at least two peer-reviewed journals dedicated to exploring evaluation more broadly, Assessment & Evaluation in Higher Education (1975-) and Studies in Educational Evaluation (1975-). For a summary overview of SETs biases, meta-analyses are offered in Wright & Jenkins-Guarnieri 2012 (which concludes that SETs are apparently valid and largely free from bias when paired with instructor consultations) and Uttl et. al. 2017 (which concluded SETs are invalid).
For the TLDR crowd, I would simply suggest reading Boring et. al. 2016, a work presented with a high level of statistical rigor examining two separate randomized experiments. It also received a fair amount of popular press. There is also a presentation on some of its principle findings by one of the contributing authors available online.
I also make no attempt to argue that one can read course evaluations in a manner that adjusts for student bias – the factors contributing to that bias are so numerous and complex that SETs should not be treated as sole objective measures for teaching quality under any interpretive lens. Recommendations on how best to use SETs in hiring, firing, or tenure decisions have also been discussed in the academic literature. A qualified (and sometimes apologetic) defense of SET is put forth by Linse 2017, while a point-counterpoint perspective is provided in Rowan et. al. 2017. In general, incorporating SETs as part of a much more comprehensive teaching portfolio appears to be the middle ground adjudicated by many university administrations. (The American Sociological Association also published its suggested guidelines for using student feedback, in light of recent research, this week.)
(2) Finding the Critical Perspective – Brookfield’s Four Lenses
From the perspective of an instructor, we must remember that student feedback constitutes only one window to our teaching. Stephen Brookfield has developed a method to help assist instructors become more critically reflective teachers by using four lenses, often simply referred to as Brookfield’s Four Lenses. In the hopes of increasing self-awareness, one must draw from several different vantage points to provide a more comprehensive perspective. These “lenses” include 1) the autobiographical lens, 2) the student’s eyes, 3) colleague’s perspectives, and 4) theoretical literature. These roughly correlate to self-reflection, student feedback (or SET), peer evaluation, and exploration of scholarly research.
Among these Four Lenses, arguably the most important is self-reflection which ultimately encompasses the other three since they all require comparative reflection. This heightened self-awareness forms a foundation for critical and reflective teaching and informs us where adjustments in our teaching may be necessary.
Lens 1A – Annotated Lesson Plans: In terms of the autobiographical lens, on a practical level, I regularly take notes after individual lectures (sometimes, simply, in the time between when one class ends and the next begins), noting things I found pertinent to the effectiveness of conveying the material, such as how long class activities took, good questions that were asked by students, insightful discussion topics, and sticking points or conceptual hurdles. Undoubtedly, these notes have become the most valuable information I consult when revisiting lectures in later semesters. Specifically, these lecture annotations allow me to adjust future material, activities, and discussions, or timing allowances.
Lens 1B – Annotated Syllabus & Journaling: Another helpful self-assessment activity has been annotating my syllabus throughout the semester, culminating in a significant review at the end of the term. By regularly taking notes on readings, class policies, grading procedures, and course organization, this information has assisted me in reconceptualizing my courses and tracing out new areas to explore. Lastly, I have implemented journaling – primarily in the form of this blog – as a means to reflect upon my experiences in the classroom (both positive and negative) and chronicle my discoveries about teaching.
Lens 2 – Mid-Term Evaluations: From this bird’s eye view, student feedback operates as just one measure of teaching quality and should be balanced against other critical perspectives. Importantly, gathering student feedback should not be reserved for only the end of a course. Informal, anonymous mid-term evaluations can provide actionable ideas that could help correct teaching oversights – or encourage us to continue what we are doing.
Typically, I will ask a pair of subjective questions: 1) “What is working well for you?” and 2) “What is not working well for you?” – both in relation to my teaching of the course material. I will also direct students to think about numerous facets of the course, including the readings, assignments, class activities (group activities or student-led discussions) and lectures – or anything else – for comment. Admittedly, not all of the anonymous feedback is constructive or actionable, but if I see clear patterns in comments I will take them into consideration when planning future classes. I also spend a few minutes at the beginning of the following lecture to discuss the feedback with the class and allow for further discussion. Moreover, I also use this as an opportunity to discuss what comments were actionable (positive or negative) and which comments were irrelevant (such as the time of the class, the size of the class, or the temperature of the room). Students need training in providing relevant and actionable narrative commentary, a point I will return to below.
Lens 3 – Teaching Community: As is commonplace in graduate school, I received no formal training in teaching as part of my program, but it was through collegial conversations with peers that my interest, and confidence, in teaching grew. Even if my colleagues did not possess formal training in pedagogy, this informal community functioned as a place to discuss classroom successes and failures and still provided another valuable perspective. In many cases, these conversations revealed the diversity of possible approaches in the classroom and inspired me to take a few pedagogical risks (or what I originally perceived to be risks).
Lens 4 – Scholarship on Teaching and Learning: In order to best make sense of the insights drawn from three lenses of the self, student, and peer, instructors should also consult literature or engage with established theory. This oftentimes provides us with technical vocabulary that can better describe the experiences were all often share. Fortunately, most universities offer workshops that instructors can attend to improve the quality and effectiveness of their teaching. Moreover, the Scholarship on Teaching and Learning (SoTL) is quite voluminous, including many journals such as College Teaching, International Journal of Teaching and Learning in Higher Education, Journal of Effective Teaching in Higher Education, and the Journal on Excellence in College Teaching, among others. There are also numerous disciplinary journals dedicate to teaching, including Teaching Theology and Religion, and the Journal of Religious Education in my home discipline of religious studies.
(3) Revisiting Student (Written) Feedback – And Hope Remained?
There is significantly more research on the close-ended ordinal scale questions of SETs than the open-ended “narrative” commentary that often accompany them. Several studies have noted that written comments can provide more useful and important feedback than statistical reports. Of course, this does not mean that all comments are necessarily relevant to teaching effectiveness nor should they be assumed to be free of bias. While a lot more research remains needs to be done in this area, written comments can contain more course (and instructor) specific details and provide actual ideas to improve teaching. Because of the potentially actionable and specific nature of written comments, instructors should strategize on how best to administer the written portion of student evaluations.
It is important for instructors to make sure students are aware of the purpose of student feedback and possibly explain how feedback may have been used in the past to create better learning environments. In order to help students reflect on the effectiveness of my teaching I will often revisit the course syllabus and have the student reread the learning outcomes, directing them to further think about how the structure of the course, readings, assignments, and activities helped or inhibited the realization of those outcomes. Focusing student attention on teaching effectiveness and quality can help minimize irrelevant commentary or comments on (perceived) instructor identity.
It is also important to inform students about the value of written comments and invite them to write down their insights. Research shows between 10% and 70% of SETs include written comments, thus asking students directly to write commentary is necessary. To ensure the comments are actionable, I also ask students to provide the rationale for their opinions (simply, I tell them to always use “because statements,” e.g. “I (dis)liked this course because…). Importantly, I also give students ample time to discuss and complete the evaluation task, around 10-15 minutes (I leave the room when students begin the evaluations).
Some institutions are beginning to start initiatives that explain the importance of student feedback to students directly and describe how to provide effective feedback. For example, the Faculty Instructional Technology Center at UC Santa Cruz provides instructions to students about crafting effective comments (with examples) and what types of comments (emotionally charged and identity-based) to avoid (see here). Moreover, the center provides instructions for instructors on how to craft the most beneficial questions, focusing on specificity, open-endedness, and goal-orientation (see here). (Similar instructions can be found at UC Berkeley Center for Teaching and Learning, University of Wisconsin-Madison, Vanderbilt University Center for Teaching.)
A more innovative approach was recently taken by the UC Merced Center for Engaged Teaching and Learning which produced a set of short 3-7 minute videos for instructors to show in their class (instructors choose which length to show, the 3-minute version is embedded above). Promoted as “Students Helping Students,” the videos feature university students talking about the importance and purpose of feedback and provides guidelines on crafting useful comments (see here).
After receiving written student feedback, instructors should pay attention to recurring themes or stories that emerge in the commentary. Non-corroborated comments mean very little, especially if they do not align with your own reflections, the observations of colleagues, or insights taken from scholarly literature. In the end, mid-term and end-of-term student feedback, especially written commentary, can offer crucial insights that allow instructors to critically self-assess pedagogical strategies and develop into reflective teachers.
 Given the subjective nature of student evaluations (described below), some institutions and researchers read the acronym SET as “student experiences of teaching.”
 Boring et. al. 2016: 2, 11
 “Teaching effectiveness” is generally, though not universally, defined as the instructor’s capacity to facilitate or increase student learning.
 Economist Richard Vedder comments that grade inflation in American universities began roughly around the same time in the late 1960’s and early 1970’s (SETs were first used in the 1920’s, see Linse 2017) when student evaluations became a common evaluation tool. The classic study on his phenomenon appears to be Johnson’s Grade Inflation: A Crisis on College Education (2003). Irrespective of the title, a large portion of the book is dedicated to analyzing SETs and their relation to course grades. More recent studies include Griffin 2004 and Stroebe 2016. To be fair, some debate the magnitude of the correlation between SET and grade expectation, see e.g. Gump 2007 and Linse 2017. One can refer to meta-analyses presented in Wright & Jenkins-Guarnieri 2012 and Uttl et. al. 2017 (latter summarized here [https://www.insidehighered.com/news/2016/09/21/new-study-could-be-another-nail-coffin-validity-student-evaluations-teaching]).
 This could also include timely psychological priming, such as telling students they are doing exceptionally well with extraordinarily difficult materials, or giving easy assignments early in the term to set up higher than normal grade expectations.
 MacNell et. al. 2015. The data was further analyzed in Boring et. al. 2016. Most of the empirical research in this area incorporates incomplete censuses of the student population (the students who simply return their evaluations) as opposed to truly random samples of the population, thus this is an important study confirming the finding of other reports.
 There are numerous other biases that have been detected in SET data, none of which is related to teaching quality, such as age, attractiveness of instructor, time of day, class size, etc.
 See Carrell & West 2010, Braga et. al. 2014, and Stroebe 2016.
 See Pepper 2010 and Carrell & West 2010. For more varied results on the relationship between low evaluations and active learning, see Henderson 2018. An overview of some of these issues for teaching physics, but relevant to other disciplines, can be found here.
 While the empirical evidence is “decidedly mixed,” (Peterson et. al. 2019), there is undeniable evidence that biases are widespread. Among the resources listed in the Kreitzer bibliography noted above are several reserch papers that have discovered statistically negligible bias in their SETs, but these seem to be the exception rather than the rule. An overview of the wide rage of biases that have been empiracally studied in SETs can be found here (University of Dayton: https://www.udayton.edu/ltc/set/faculty/bias.php).
 Link: https://docs.google.com/document/d/14JiF-fT–F3Qaefjv2jMRFRWUS8TaaT9JjbYke1fgxE/mobilebasic?fbclid=IwAR3_W1actb5hg-rf2bbxrwAlal2K16askYDm5EJOTdeRCptkZEFuryrxQAY. A different annotated bibliography, updated by Danica Savonick and Cathy N. Davidson, can also be found here: https://www.hastac.org/blogs/superadmin/2015/01/26/gender-bias-academe-annotated-bibliography-important-recent-studies.
 I am grateful to Lisa Berry for informing me of Brookfield’s body of work.
 Brookfield first proposed this model in 1995 in Becoming a Critically Reflective Teacher (2nd edition published in 2005). There are numerous online resources summarizing the principle arguments, here is one [https://www.learning.ox.ac.uk/media/global/wwwadminoxacuk/localsites/oxfordlearninginstitute/documents/supportresources/lecturersteachingstaff/resources/resources/CriticallyReflectiveTeaching.pdf].
 Admittedly, this may seem objectionable to some because it appears like tampering with student opinions of the course. But this approach is modeled on training students to give useful feedback on peer-reviewed papers; students need practice and need to receive feedback to best learn how to be effectively critical.
 I will forever remain indebted to my university Writing Program which offered formal training in pedagogy, ultimately leading to working with our school’s Instructional Development as a consultant.
 A personal example: While training to teach a first-year composition and rhetoric courses I was given a reading that distinguished between “boundary crosser” students and “boundary guarder” students as pertaining to how they accessed and made use of prior genre knowledge. These proved to be helpful in giving me a conceptual “handle” to understand my experience with several students and to have a common vocabulary with my peers to discuss different approaches to these students.
 Noted (with references) in Brockx et. al. 2012: 1123.
 Perhaps the most cited internet resource to demonstrate bias in written commentary is the Gendered Language in Teacher Reviews, run by Ben Schmidt. The site aggregates data from RateMyProfessor.com and allows users to sort data by keywords.
 Brockx et. al. 2012
 One recent study (Peterson et. al. 2019) has shown, however, that by explaining the implicit biases for race and gender found in SETs to students, those biases were significantly mitigated in the evaluations (in comparison to a control group).
Here is the anti-bias language that was used in the experiment: “Student evaluations of teaching play an important role in the review of faculty. Your opinions influence the review of instructors that takes place every year. Iowa State University recognizes that student evaluations of teaching are often influenced by students’ unconscious and unintentional [bolded in original] biases about the race and gender of the instructor. Women and instructors of color are systematically rated lower in their teaching evaluations than white men, even when there are no actual differences in the instruction or in what students have learned. As you fill out the course evaluation please keep this in mind and make an effort to resist stereotypes about professors. Focus on your opinions about the content of the course (the assignments, the textbook, the in-class material) and not unrelated matters (the instructor’s appearance).” Much more research needs to be done exploring this deeply important issue.
- Boring, Anne, Ottoboni, Kellie & Stark, Philip B. 2016. “Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness.” ScienceOpen Research. [DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
- Braga, Michela, Paccagnella, Marco & Pellizzari, ]Michele. 2014. “Evaluating Students’ Evaluations of Professors.” Economics of Education Review, Vol. 41, pp. 71.
- Brockx, Bert,Van Roy, K. & Mortelmans, Dimitri. 2012. “The Student as a Commentator: Students’ Comments in Student Evaluations of Teaching.” Procedia – Social and Behavioral Sciences Vol. 69, pp. 1122-1133.
- Carrell, Scott E. & West, James E. 2010. “Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors.” Journal of Political Economy, Vol. 118, No. 3, pp. 409-32.
- Griffin, B.W. 2004. “Grading Leniency, Grade Discrepancy, and Student Ratings of Instruction.” Contemporary Educational Psychology, Vol. 29, pp. 410–25.
- Gump, S.E. 2007. “Student Evaluations of Teaching Effectiveness and the Leniency Hypothesis: A Literature Review.” Educational Research Quarterly, Vol. 30, pp. 56–69.
- Henderson, Charles, Khan, Raquib & Dancy, Melissa. 2018. “Will My Student Evaluations Decrease if I Adopt an Active Learning Instructional Strategy?” American Journal of Physics, Vol. 86, No. 934. [DOI: 10.1119/1.5065907]
- Låg, Torstein & Sæle, Rannveig Grøm. 2019. “Does the Flipped Classroom Improve Student Learning and Satisfaction? A Systematic Review and Meta-Analysis.” AERA Open. [DOI: 10.1177/2332858419870489]
- Linse, Angela R. 2017. “Interpreting and Using Student Ratings Data: Guidance for Faculty Serving as Administrators and on Evaluation Committees.” Studies in Educational Evaluation, Vol. 54, pp. 94-106.
- MacNell L, Driscoll, A. & Hunt A.N. 2015. “What’s in a Name? Exposing Gender Bias in Student Ratings of Teaching.” Innovative Higher Education, Vol. 40, No. 4, pp. 291–303.
- Marsh, H.W. & Roche L.A.. 2000. “Effects of Grading Leniency and Low Workload on Students’ Evaluations of Teaching: Popular Myth, Bias, Validity, or Innocent Bystanders?” Journal of Educational Psychology, Vol. 92, pp. 202–28.
- Peterson, David A. M., Biederman, Lori A., Andersen, David, Ditonto, Tessa M. & Roe, Kevin. 2019. “Mitigating Gender Bias in Student Evaluations of Teaching.” PLoS ONE, Vol. 14, No. 5. [DOI: 10.1371/journal.pone.0216241]
- Rowan S., Newness E.J., Tetradis S., Prasad J.L., Ko C.C. & Sanchez A. 2017. “Should Student Evaluation of Teaching Play a Significant Role in the Formal Assessment of Dental Faculty? Two Viewpoints: Viewpoint 1: Formal Faculty Assessment Should Include Student Evaluation of Teaching and Viewpoint 2: Student Evaluation of Teaching Should Not Be Part of Formal Faculty Assessment.” Journal of Dental Education, Vol. 81, pp. 1362-72.
- Stroebe, Wolfgang. 2016. “Why Good Teaching Evaluations May Reward Bad Teaching: On Grade Inflation and Other Unintended Consequences of Student Evaluations.” Perspective on Psychological Sciences, Vol. 11, No. 6, pp. 800-16.
- Uttl, Bob, White, Carmela A. & Gonzalez, Daniela Wong. 2017. “Meta-Analysis of Faculty’s Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning are Not Related.” Studies in Educational Evaluation, Vol. 54, pp. 22-42.