Decisions about promotion and tenure in political science departments include an evaluation of teaching effectiveness. Although some universities have moved beyond sole reliance on student evaluations of teaching (SETs), they remain a core part of the teaching dossier. Many female faculty members believe that they face prejudice in SETs. However, skepticism remains about the existence or degree of gender bias in SETs. Historically, systematic studies of SETs were mixed in their findings of gender bias; however, newer and more rigorous studies show an emerging consensus that gender bias does exist. This article builds on the broad body of work on gender bias in SETs to extend these findings to political science departments and to introduce a new argument about the interaction between instructor gender and class size.
This article presents a number of interrelated arguments. Increasingly, the literature suggests that female instructors receive lower rankings than male instructors across a range of disciplines. In a twist on this research, I argue that the effect of an instructor’s gender should be dependent on the size of the course. My review of the literature on gender and leadership assessments suggests that there should be an interaction between the gender of the instructor and student assumptions about leadership roles. Thus, when a course requires that a teacher take on a stereotypical leader role—such as a large lecture course—assumptions about gender roles could have a significant impact on evaluations. I provide an empirical assessment of the hypothesis about an interaction between class size and gender bias using publicly available SET data from two political science departments at large public universities. These data show, as expected, that female faculty members receive lower evaluations of general teaching effectiveness in large courses than male faculty members, whereas there is no substantial difference for small courses. To the extent that teaching evaluations are an important part of promotion and compensation decisions and other reward systems within universities, reliance on SETs that appear to be biased creates concerns. These concerns suggest that the discipline must reconsider its methods of faculty evaluations and the role that they have in professional advancement.
The first section of the article discusses the general literature on gender bias in SETs. The second section turns to theory, arguing that role-incongruity theory strongly indicates that there should be an interaction between the degree of gender bias and class size. The third section presents empirical evidence from two political science departments and concludes by drawing implications for the use of SETs in processes of professional advancement and reward.
GENDER BIAS IN EVALUATION OF TEACHING EFFECTIVENESS
The potential for gender bias in SETs has long been recognized and discussed. This section summarizes the general literature on gender and SETs and the more limited work on this relationship in the political science discipline. The role of class size is rarely mentioned in these studies. It is worth noting, first, that studies of possible gender bias in SETs in higher education began appearing in the 1980s and 1990s, and early findings were mixed (e.g., Basow and Silborg Reference Basow and Silberg1987; Centra and Gaubatz Reference Centra and Gaubatz2000; Feldman Reference Feldman1993; Sidanius and Crane Reference Sidanius and Crane1989).
However, recent and more rigorous studies show consistent evidence of bias. These studies are based on both experiments and observational analysis. Arbuckle and Williams (Reference Arbuckle and Williams2003) undertook a fascinating experiment in which students viewed a stick figure that delivered a short lecture. All participants observed the same stick figure and the same lecture but the figures were given labels of old or young and male or female. Participants significantly rated the figure labeled as a young male as the most expressive, which illustrates that students’ expectations influence their perception of an instructor independent of the material or how it is delivered. A similar experimental setup in a distance-education course allowed researchers to manipulate whether a male or female instructor was teaching the course and whether students believed that the instructor was male or female (MacNell, Driscoll, and Hunt Reference MacNell, Driscoll and Hunt2014). The authors found that “the male identity received significantly higher scores on professionalism, promptness, fairness, respectfulness, enthusiasm, giving praise, and the student ratings index” (MacNell, Driscoll, and Hunt Reference MacNell, Driscoll and Hunt2014, 8), regardless of whether the instructor was actually male or female. One particularly striking finding in this study was that even relatively objective questions, such as whether the instructor was prompt, led students to score the instructor almost one point lower on a five-point scale if they believed that the instructor was female. This finding suggests that the fault of SETs is not in the way that questions are posed or which qualities they ask about; rather, the fault lies in the nature of the instrument itself.
Other recent work relies on observational rather than experimental techniques. Miller and Chamberlin (Reference Miller and Chamberlin2000) focused on students’ perception of instructor educational credentials and found that they perceive male instructors as having higher or superior credentials. In a recent study undertaken in an Italian engineering college, Bianchini, Lissoni, and Pezzoni (Reference Bianchini, Lissoni and Pezzoni2012) found that in three of the four programs they examined, women consistently received significantly lower effectiveness scores than men. The authors speculated that the gender composition of the student body could account for their findings because two of the four programs had low percentages of female students.
In an especially well-designed observational study, Boring (Reference Boring2015) compiled more than 22,000 observations of student ratings in a French school of social science. She examined mandatory introductory classes in which students’ ability to choose their instructor is tightly constrained. The courses include a standard final examination that is graded anonymously, which provides an independent, objective measure of student learning. The numerous observations allowed Boring to control for both student and teacher fixed effects. All of these factors allowed for an unbiased and reliable measure of bias, representing a major improvement on other observational studies. They allowed Boring to not only measure the degree of gender bias in SETs but also to explore its roots and whether instructor ratings are a good indicator of teaching effectiveness.
Boring’s results are striking. She found that male instructors receive significantly higher ratings, which results from a strong male-student bias. Male students are 30% more likely to give a rating of “excellent” to male than female teachers (Boring Reference Boring2015, 5). Female instructors scored relatively well in more time-consuming tasks, such as course preparation, whereas male instructors scored well in less time-consuming activities, such as leadership skills. Boring also found that students who receive higher grades give higher instructor ratings, and she calculated that women could receive the same rating as men if they gave students a 7.5% boost in their grades (Boring Reference Boring2015, 2). Because Boring used the final exam as an independent measure of student learning, she could explore the degree to which student performance is correlated with higher teacher ratings. She found that it is not correlated and that “SET scores do not seem to measure actual teaching effectiveness” (Boring Reference Boring2015, 2).
Within political science, the APSA has occasionally published a piece in PS that draws attention to the potential for bias in SETs, and it offers advice for concerned faculty. Langbein (Reference Langbein1994) noted that the effect of low grades on teaching evaluations is more pronounced for female than male faculty. Noting that poor evaluations can have negative effects on promotion and compensation decisions, Langbein questioned whether SETs are adequately valid measures of teaching effectiveness to have such an important role. Andersen and Miller (Reference Andersen and Miller1997) noted that female instructors who are not perceived as caring and accessible may fail to meet student expectations and therefore may be penalized on SETs. Sampaio (Reference Sampaio2006) examined the intersection of gender, race, and subject matter, focusing on implications for women of color in the classroom. Dion (Reference Dion2008) reviewed the literature on bias and offered advice for women faculty who must be both authoritative and nurturing. In related work, Baldwin and Blattner (Reference Baldwin and Blattner2003) suggested that because SETs may be biased, alternative evaluation measures should be considered. Smith (Reference Smith2012) noted that SETs are used for both professional development and employment decisions, setting up tensions. These tensions are especially pronounced, given questions about the validity and reliability of SETs as well as peer observation of teaching.
Small seminars allow for extensive one-on-one interaction and the ability to establish empathy while still demonstrating mastery of the material. However, in large lecture courses, the opportunities to exhibit sensitivity to individual students are more limited.
ROLE INCONGRUITY AND LEADERSHIP IN LARGE CLASSES
We can make more sense of studies of gender bias in SETs by turning to the psychology literature on role incongruity and leadership. A body of work known as “role-congruity theory” puts these studies of SETs in context and suggests more refined ways to approach the question of gender bias. The idea behind role-congruity theory is that individuals enter social interactions with implicit assumptions about the roles that others will play. Gender roles are prominent in this literature, with men implicitly associated with the “agentic” type: more assertive, ambitious, and authoritative. Women tend to be implicitly associated with the non-agentic type: more passive, nurturing, and sensitive. Role incongruity occurs when a man or a woman acts in a way that is contrary to type—for example, if a woman takes on an agentic demeanor. A situation that demands that a woman be agentic will cause role incongruity and can lead to negative reactions from students. I link this body of theory to SETs by noting that some class settings demand a more agentic approach than others. Small seminars allow for extensive one-on-one interaction and the ability to establish empathy while still demonstrating mastery of the material. However, in large lecture courses, the opportunities to exhibit sensitivity to individual students are more limited. At the same time, these “sage-on-a-stage” formats demand that the instructor be assertive and demonstrate consistent authority.
Although the literature on role congruity and leadership is extensive, I summarize the studies linked most directly to my focus on SETs. Butler and Geis (Reference Butler and Geis1990) used experimental approaches to examine the role of gender and leadership in the reactions of observers. They focused on nonverbal responses—in particular, positive or negative facial reactions of participants who observed leaders making suggestions for certain courses of action. Female leaders elicited significantly more negative facial expressions than males in the same situation. Ridgeway (Reference Ridgeway2001) discussed “gender status beliefs” and how they constrain individuals’ expectations of leaders. Gender status beliefs lead individuals to assume that men will be more competent and assertive as leaders. Experiments that test these ideas reveal that when women are placed in a leadership role and act assertively, they are punished. Rudman and Glick (Reference Rudman and Glick2001) also examined the potential for backlash against agentic women. They found that women who violate stereotypes by exhibiting intelligence, ambition, and assertiveness elicit negative reactions. However, this effect can be mitigated if women “temper their agency with niceness” (Rudman and Glick Reference Rudman and Glick2001, 743).
In Eagly and Karau’s (Reference Eagly and Karau2002) review of the work on role-congruity theory and female leadership, they found that two forms of prejudice are most prominent. First, women are generally viewed less favorably as leaders. Second, when women exhibit behaviors that are associated with leadership (e.g., projecting authority), they are evaluated less favorably than men. In a novel multimethod approach, Johnson et al. (Reference Johnson, Murphy, Zewdie and Reichard2008) conducted a series of tests of role-congruity theory using qualitative, experimental, and survey approaches. They contrasted the “strong” (agentic) type to the “sensitive” (non-agentic) type. Consistent with other studies, they found that female leaders must project both strength and sensitivity to be effective, whereas male leaders need only project strength.
Taken as a whole, these studies argue for a more nuanced approach to the potential for gender bias in SETs. Different types of courses demand that instructors assume different roles. In small classes (e.g., seminars), the instructors usually are seated and their role is to guide discussion and draw out students’ thoughts, thereby facilitating class discussion. In this setting, students likely do not come to class with expectations that the instructor will play the typical agentic-leader role. However, when contrasted to a large lecture course, when the instructor is on a stage with a microphone speaking in front of hundreds of students, the opportunities for interaction with individual students, to express concern for their specific needs, and to draw out their opinions are limited. Instead, students are likely to come to class with standard expectations of agentic leadership.
If this is the case, the potential for backlash against agentic women will be significant in large lecture settings, whereas it is likely to be minimal or absent in small class settings. Ratings for female instructors tend to decline with class size at a higher rate than for male instructors. This logic leads to the following hypothesis.
Hypothesis 1: The interactive effect between male gender and class size on SETs will be positive.
Hypothesis 1 can explain why early studies did not find gender bias in SETs. Perhaps these biases primarily arise when leadership expectations are invoked—that is, in large classes. If women tend disproportionately to teach smaller classes than men (perhaps because of negative feedback when they attempt large courses), the interaction between course size and instructor gender could lead to average effects of gender being washed out. If this hypothesis is correct, then we need an interaction effect between class size and lower effectiveness ratings for female faculty in order to test it. The presence of such an effect would validate the relevance of role-congruity theory to the classroom and renew concerns about reliance on SETs as measures of teaching effectiveness.
Whereas other types of interaction effects between gender and other course characteristics have received attention, this specific interaction between course size and instructor gender has not been studied in depth. One exception is Wigington, Tollefson, and Rodriguez (Reference Wigington, Tollefson and Rodriguez1989), who collected data involving 5,843 student evaluations at a midwestern university in the mid-1980s. The authors found that the expected effect did appear: “The interaction between sex and size was due to males having higher ratings than females in the larger classes…” (Wigington, Tollefson, and Rodriguez Reference Wigington, Tollefson and Rodriguez1989, 339). This effect was reversed for small classes. Unfortunately, the authors did not pursue this result any further and it apparently has gotten lost in a general sense that “interactions matter.” More recently, in a study at a college of engineering, Johnson, Narayanan, and Sawaya (Reference Johnson, Narayanan and Sawaya2013) found that female instructors receive lower ratings, as do larger classes. However, they did not examine the interaction between these two factors. The next section presents new evidence on the interaction between course size and instructor gender using data from political science departments.
EVIDENCE AND IMPLICATIONS
Today, only a few public universities make SET results publicly available. The following analysis is based on records from two political science departments in large, public research universities. One is a southern university, for which I have data from 2011 through 2014; the other is a western university, which includes data from 2007 through 2013. Total enrollment in the southern university is more than 58,000 and it is more than 31,000 in the western university. Both are well-ranked R1 research universities with large political science departments. Both administer their evaluations online. I collected all evaluations from undergraduate courses taught by faculty during the years indicated. According to the universities’ own documentation, these evaluations are required for consideration during promotion and tenure reviews. The southern university requires that the tenure dossier include a “complete longitudinal summary” of SETs in tabular form. The western university’s guidelines are less precise but specify that SETs must be included as one of two forms of teaching evaluation. Therefore, these instruments have a direct impact on professional advancement at the two institutions.
To investigate the predicted interactive effect of gender and class size, I used Tobit analysis. This approach is appropriate because the data are censored at both the top and the bottom of the five-point scale. That is, even students who loved the class cannot give a score above five and those who hated it cannot give a score below one. Table 1 shows the results of Tobit analysis, examining the effect of gender, course size, and interaction between the two on average course evaluations.
For the southern university, the dependent variable in this analysis is the average response, on a five-point scale, to the statement: “Overall, this instructor was effective.” “Strongly agree” is equivalent to five points and “strongly disagree” is equivalent to one point. Analysis is based on all 309 faculty evaluations available on the university’s website for this time frame. Enrollment in courses was not available, so course size is estimated by the number of students who completed the evaluation. Footnote 1 The western university also uses a five-point scale. The question asked is whether students “learned from the course.” Enrollment data are available for this university, and the dataset includes 587 evaluated courses.
For both universities, the evidence supports Hypothesis 1. The coefficients are in the expected direction, showing a positive interaction effect between a male instructor and a larger class. The results for the southern and western universities are statistically significant at the 0.10 and 0.05 levels, respectively. Table 2 and figure 1 summarize the estimated substantive effects.
For a small course with 10 students, there is little difference in ratings between male and female instructors. For a larger course with 100 students, a more sizeable difference emerges, with males scoring two tenths and one tenth of a point higher in the southern and the western universities, respectively. For courses approaching the largest in the sample (i.e., 200 students in the southern, 400 in the western) a significant gap emerges, with male instructors scoring half a point higher. Given differences in course sizes, average evaluations, and wording of questions, the estimated effect of interaction between gender and course size is remarkably consistent across the two universities. Differences of this magnitude are large enough to capture the attention of promotion and tenure committees, award committees, and the like. For universities that offer even larger classes, the cumulative effect would be massive. Although this particular study is based on only two universities, it is consistent with studies in other fields and with the theoretical literature on role incongruity. It shows a systematic and sizeable bias against female instructors in large courses.
Given differences in course sizes, average evaluations, and wording of questions, the estimated effect of interaction between gender and course size is remarkably consistent across the two universities. Differences of this magnitude are large enough to capture the attention of promotion and tenure committees, award committees, and the like.
What difference does this apparent bias make? Of course, it depends on institutional practice. The worst-case scenario includes exclusive or predominant reliance on SETs for assessment of teaching effectiveness; emphasis on success in teaching larger courses; and a prominent role for teaching evaluations in professional advancement. Whereas these conditions do not hold in all or perhaps even most political science departments, they are not uncommon. For example, decisions about retention of adjunct faculty often are based solely on SETs; therefore, individual careers are wholly dependent on this one apparently biased measure.
An immediate effect of bias is likely that women disproportionately teach smaller courses than men. This could result from several mechanisms: women self-selecting out of teaching large courses; departments channeling women into teaching smaller courses; and students selecting into lectures that are taught by men. I do not take a stance on what the causal mechanism is; however, to the extent that successful teaching of large classes provides material or other rewards within departments, any process that leaves women disproportionately teaching small classes is an impediment to professional advancement. In the datasets analyzed here, there is evidence of women systematically teaching smaller courses than men. The mean course size for female faculty at the southern university is 34 students; for male faculty, it is 51 students. In the western university, courses taught by female faculty have an average size of 91 students; in those taught by male faculty, it is 123 students. A two-sample t-test shows that these are statistically significant differences in mean course size. Footnote 2
More than 30 years ago, Martin (Reference Martin1984) wrote that the “message to women faculty seems clear: if your institution bases personnel decisions on student evaluations, make sure your colleagues are aware of the possibility of sex bias” (Martin Reference Martin1984, 492). Three decades later, we essentially use the same evaluation tools, and colleagues remain skeptical of the presence of gender bias. Specifically for evaluations of women faculty in large courses, bolstered by studies in other disciplines, we find that the bias is strong and must be considered by departments and universities.
Given increasing evidence on gender bias in SETs, it is time for the pendulum to swing in the other direction: away from telling women to lean in and to perform better within the current system and toward developing better metrics of teaching effectiveness.
Recent public debate about women’s professional advancement has fallen into a dichotomy between those who argue that ambitious women need to “lean in” and those who draw attention to structural and implicit biases that work against women’s success at the highest levels. This current debate has direct relevance to the topic of this article. Gender interacts with aspects of the classroom environment to influence SETs. In particular, when women assume a stereotypical leadership role, as in a large lecture course, beliefs about gender and leadership have an impact on evaluations of teaching effectiveness. The evidence presented in this article supports this hypothesis and questions the use of SETs in consideration of promotion, compensation, awards, prominent administrative positions, and similar tokens of professional success. As Boring (2015, 6–7) concluded: “[S]tudents are not evaluating teachers’ helpfulness in making them learn when they complete their evaluations…. And yet, universities continue to use this tool in a way that may hurt women (and probably other minorities as well, and men who do not correspond to students’ expectations in terms of gender stereotypes) in their academic careers.”
Regarding the lean-in versus structural impediments dichotomy, the literature so far has fallen heavily on the former. Publications in political science journals (as well as in other disciplines) offer advice on how female faculty can increase their scores on SETs. Women have reported engaging in tactics to show their sensitivity to student needs and to illustrate their “niceness.” Many also take steps to better project their authority and competence, such as by participating in acting workshops. They spend considerable time on course preparation and organization. Some of these steps increase actual teaching effectiveness. However, faculty members—male and female—acknowledge that SETs can be gamed, and they offer advice on how to do so. Therefore, we are all encouraged to take the existing evaluation system as given and to lean in.
Given increasing evidence on gender bias in SETs, it is time for the pendulum to swing in the other direction: away from telling women to lean in and to perform better within the current system and toward developing better metrics of teaching effectiveness. For example, when we consider teaching effectiveness for graduate courses, we might consider SETs. However, a far more persuasive and widely used indicator of whether a professor is effective in training graduate students is results: Do the professor’s students obtain good jobs and go on to become prominent figures in the profession? To the extent that we can move away from SETs as a sole or primary indicator of teaching effectiveness at the undergraduate level and emulate what we naturally do at the graduate level, our assessments would be more reliable. Some institutions have moved toward a process of peer review to complement SETs. Although this innovation makes some faculty uncomfortable, peer review by faculty members who are given advice on how to do it well could be a substantial improvement on the currently dominant system (Stark and Freishtat Reference Stark and Freishtat2014). Evaluation by trained observers is another possibility, although it would require investment by universities.
It also is possible that in some settings, more objective measures of teaching success could be developed. If multiple sections of the same course are taught by different faculty, for example, it may be possible to ask students to engage in a form of standardized assessment of how much they have learned. Effectiveness in teaching large introductory courses could be measured by assessing how well students perform later in more advanced courses. One recent study examined such a setting, in which economics students at Bocconi University were randomly assigned in introductory economics courses (Braga, Paccagnella, and Pellizzari Reference Braga, Paccagnella and Pellizzari2014). The authors found that, indeed, SETs are significantly correlated with success in more advanced courses—but in the wrong direction. That is, teachers who receive lower ratings produce students who go on to achieve higher grades in advanced classes. Footnote 3
Of course, none of these changes could be implemented immediately or without controversy. However, given the long-standing concerns about heavy reliance on SETs, theory that bolsters these concerns, and evidence of bias in SETs in political science, change is long overdue. Questions about how new assessment technologies might work is no excuse for continuing to rely on existing mechanisms that are known to be faulty. We have enough advice on how to lean in; it is time to make structural changes.
The work presented in this article was influenced by many conversations with colleagues. I particularly thank Eve Fine, Yoi Herrera, Bob Keohane, Helen Kinsella, Rose McDermott, Ryan Powers, and Barbara Walter. Any mistakes, of course, are solely my responsibility.