Untitled Document
Overview
Understanding student ratings requires an understanding of statistical concepts
related to sampling, significance, and precision, as well as an understanding
of the characteristics of ratings as a measure of teaching performance. Because
student ratings statistics do not have the precision typical of statistics in
the sciences, it is always important to interpret them in the context of individual
and unit patterns. OIRPS offers workshops and consultation on interpreting these
statistics and using TCE results appropriately, as well as on other aspects
of evaluating teaching.
The primary units of analysis in TCE reports are individual student responses
within individual sections. Many reports also show summaries of results for
the same questions from sets of similar courses. This section describes the
statistics used in the reports and offers suggestions for their interpretation,
along with information about general characteristics of student ratings.
OIRPS recommends a three-step procedure for reviewing TCE reports:
Step 1: Check the sample
Step 2: Review individual results
Step 3: Review comparison statistics
Checking the Sample
Before using ratings, it is important to know how representative the available
data are. Standards for samples depend on how the ratings will be used: they
should be most stringent when ratings are used in performance review.
Sample Quality within Sections
For responses in a section to be meaningful for decision-making purposes, they
must be representative of the entire class. Information about the sample is
printed at the top of each TCE report: number enrolled, number responding, and
percent responding. Use Table 1 to decide whether enough students responded
for the sample to be meaningful.
The higher the proportion of respondents to those enrolled, the more representative
the results. In general, sections with less than a 50% response rate should
not be used for performance appraisal. The smaller the class, the higher the
percentage of responses needed to ensure that the sample is representative .
If the non-response rate seems high, there may be a systematic reason for student
absence that might bias results. For example, if ratings are administered the
day of a review session when attendance is optional, students for whom instruction
has been most effective may be excluded.
If only a small fraction of students respond, the responses can only be considered
the opinions of those few students – even though it may be tempting to
generalize if they are positive.
| Table 1: Guidelines for Judging Samples Within Sections |
| Class size |
Recommended response % |
| 5-20 |
at least 80%, more recommended |
| 20-30 |
at least 75%, more recommended |
| 30-50 |
at least 66%, 75% or more recommended |
| 50-100 |
at least 60%,75% or more recommended |
| 100 or more |
more more than 50%, 75% or more recommended |
While the results from a single administration of a TCE questionnaire, particularly
a long questionnaire, can provide useful information, such results apply to
the course as one event in time only. Averaged results from comparable courses
taken over several evaluations (each with an adequate sample of response) are
more likely to fairly represent teaching ability. A minimum of five courses
is recommended. It is also important to ensure that the courses selected are
representative. If an instructor’s teaching load is half graduate courses
and half undergraduate courses, the sample presented for review should be about
half graduate and half undergraduate courses. Most importantly, no single score
or set of scores from a single section should be used for judging teaching performance
for performance appraisal.
Sample Quality of Comparison Groups
Questions to ask about comparison groups include:
1) Are the courses in the comparison group reasonably comparable in content,
size, and instructional methods?
2) Are there enough courses in the comparison group?
3) Were a substantial number of courses that met the selection criteria for
the comparison group not included because their instructors did not participate
or because insufficient student response, lack of documented student monitoring,
or other errors invalidated the data?
4) How many different instructors taught courses included in the comparison
group?
Reviewing the Section Results
Frequencies and Percent of Valid Responses
For each question, the distribution of student responses across the possible
response choices is given in frequency of responses per option and percent of
valid responses per option. Interpreting the data is largely common sense –
how many students "said" what, in terms of the available response
options for each question. Usually, students are in fairly good agreement in
their ratings and scores cluster around two or three adjacent options.
For positively-stated questions concerning effective teaching, it is desirable
for responses to cluster in the first two options, "almost always"
and "usually." If a substantial percentage of students respond "sometimes,"
"rarely," or "almost never," the question points to an area
of teaching skill that likely needs attention. Responses should cluster similarly
for questions with response scales worded "very useful" to "nearly
useless."
For questions with normatively worded response options such as "among the
best" to "among the worst," more caution is needed, as the basis
for comparison is unknown. For example, if a student has taken only exceptionally
well-taught courses, a moderately well-taught course might seem poor by comparison.
Means, Medians, and Standard Deviations
Means and medians are measures of central tendency, showing the "middle"
of a set of scores. The standard deviation (SD) is a measure of how variable
scores are, i.e. how spread out they are around that "middle." Means
and SDs appear on all reports in both section data and comparison data. Medians
appear only in comparison data. Means, medians, and standard deviations are
in the same units as the original sample.
The mean for a question is the arithmetic average of student responses. For
most TCE questions, means can range from 1 to 5. Most questions are reverse
scaled: that is, the most positive option, "A," is scored as 5 points.
The "Key" on each question tells how individual questions were scored.
The SD gives an approximate measure of agreement or disagreement among raters.
Perfect agreement would yield an SD of 0. In a typical class, about two thirds
of ratings fall within one rating point above or below the mean and the SD is
1.0 or less. If the SD for a question scaled with 5 points is higher than 1.2,
the mean is not a good measure of student response.
High SDs occur when opinion in a class is strongly divided between very high
and very low ratings, or when opinion is dispersed across the entire response
scale. Because students and teachers vary, it is possible for a teacher to be
"among the best" for some and "among the worst" for others.
In such cases, the mean does not represent a "typical" student opinion
in any meaningful sense. Consultation to explore the source(s) of consistently
high SDs is available from OIRPS.
Confidence Intervals
Most OIRPS reports show a 95% confidence interval (CI) in parentheses to the
right of the section means and comparison group means. While the SD gives an
approximate measure of the amount of disagreement among students, the 95% CI
shows the impact of the disagreement on the precision of the mean as a way of
summarizing responses.
The 95% CI is similar to the "margin of error," a familiar feature
of opinion polls which assigns a value, plus or minus, within which the "true"
score occurs once all sources of error and disagreement are taken into account.
There is a 95% chance that the true score for a question occurs somewhere in
the interval between the two values.
Reviewing the Comparison Statistics
For spring 2000 and subsequent reports, comparison group statistics appear on
the final page along with one or more graphics showing how results for the section
compare with results for the comparison group. This page is titled “TCE
Comparison Report.” For reports issued prior to spring 2000, statistics
for the comparison group appear on the Short Report in the column labeled "Comparison
Group" (between the section statistics and the columns showing T scores
and Percentile Rank Groups (%Rank)).
Descriptive statistics for comparison groups include the number of sections
in the comparison group, the grand mean and its 95% CI, and the median of section
means for each question. A comparison group mean is the grand mean of a set
of section means, not the mean of student responses pooled across the sections.
Similarly, the comparison group SD is the deviation of the section means. The
median is the halfway point: half of all the means in the comparison group fall
above the median and the other half below.
Systematic Variation in Ratings
Although properly administered student ratings are quite dependable, research
shows that there are predictable sources of systematic variation and bias which
should be considered when comparing scores. To address potential concern about
three factors known to cause systematic variation in ratings (disciplinary differences,
course level and course size), we have based our comparison groups on these
variables. As our database grows, other factors may be taken into account. However,
research shows that taken together, all the sources of variation listed typically
account for less than 5% of variation in overall instructor ratings.
Factors Likely to Cause Systematic Variation in Ratings
1. Disciplinary Differences
Significant differences between ratings of courses in different disciplines
are well documented. For example, courses in the humanities and fine arts tend
to be rated more highly than those in physical and applied sciences. For this
reason, most sources agree that ratings should not be compared across disciplines.
(If cross-disciplinary comparisons of faculty are necessary, faculty standings
within their own comparison groups can be compared.) Unless faculty have recommended
combining similar subject areas, our reports always restrict comparisons to
the subject area defined by the course subject code, e.g., ANTH, MUSI, POL,
etc.
2. Course Level
Lower division students tend to give the lowest ratings; graduate students tend
to give the highest ratings.
3. Class Size
Small classes (fewer than 20 students) tend to receive the highest ratings,
whereas large classes (40-100) tend to receive the lowest ratings. Classes of
more than 100 students tend to receive intermediate ratings, which suggests
that students may have different criteria for evaluating them.
4. Course Status
Students tend to give electives and courses in their majors slightly higher
ratings than courses taken to fulfill a college or general education requirement.
5. Semester or Summer Session
Summer Session ratings, on average, are significantly higher than fall or spring
ratings for comparable courses at UA. Thus, unless otherwise noted, comparison
groups do not include Summer Session data.
6. Course Content
Differences in ratings are occasionally associated with course content. For
example, courses with quantitative content may receive slightly lower ratings
than other courses at the same level in the same subject area. Similarly, courses
that challenge strongly held beliefs may receive lower ratings from some students.
7. Years of Teaching Experience
Instructors with less than one year of experience tend to receive the poorest
ratings. Teachers with between three and twelve years experience tend to receive
the best ratings, while those with more than twelve years tend to receive intermediate
ratings.
8. Improper Administration of Questionnaires
Student ratings can be biased by failure to adhere to instructions for administering
the questionnaire, such as failure of the instructor to leave the room during
administration, failure to preserve student anonymity, administration of the
evaluation during finals, and use of prejudicial introductory remarks. (The
TCE monitoring system is a strategy to minimize such problems.)
Factors That Have Little Influence on Ratings
1. Scheduling Factors
Time of day and other scheduling factors appear to have little or no influence
on ratings. However, systematic differences in who attends classes at particular
times could theoretically have some impact on ratings.
2. Students’ Academic Ability
Academic ability, as measured by grade point average, has little relationship
to student ratings. Evidently, poor students are just as appreciative of good
teaching as good students, while good students are just as critical of poor
teaching as less able students. However, when there is great variety in students’
prior learning and abilities in a course, the instructor may end up concentrating
on one group of students to the exclusion of others. In such a situation, the
actual quality of teaching varies within the class and will probably be reflected
in the ratings.
3. Gender
Researchers looking for correlations between ratings and gender have found significant
variation, but in both directions. That is, some studies show female faculty
receiving higher ratings while others show male faculty receiving higher ratings.
In either case, the differences are typically trivial, accounting for less than
2% of the variation in ratings. Female students tend to give slightly higher
ratings than male students and some studies have found correlations based on
whether student and teacher gender are the same. At UA, female instructors tend
to receive higher ratings in most subject areas. If you suspect a systematic
pattern of gender bias in ratings for a particular course, please contact OIRPS.
4. Perceived Difficulty, Workload, and Expected Grades
The relationship between grades and ratings is complex. The preponderance of
research evidence shows a very small positive correlation between ratings and
expected grades. There is also some evidence that students will tend to give
lower ratings when they expect grades lower than they usually get in other courses.
A meta-analysis (Cohen, 1981) explored the relationship between overall instructor
ratings and student achievement as measured by scores on an independently-graded
final exam in multiple sections of the same class taught by different instructors.
Cohen found that students who received high scores on the final tended to rate
their instructors highly (regardless of the instructor), suggesting that successful
students tend to credit their instructors for their success.
Centra, J.A. (1975) Colleagues as raters of classroom instruction. Journal
of Higher Education, 46: 327-337.
Cohen, P.A. (1981) Student ratings of instruction and achievement: a meta-analysis
of multisection validity studies. Review of Education Research, 1981, 51, 281-309.