Rubrics are rating forms designed to capture evidence of a particular quality or construct. The quality of the measure obtained from a rubric depends on how well it is designed. If the rubric is poorly designed, the rating is confounded or inaccurate. The quality of the rubric is also affected by the skill of the rater using the rubric. When using a rubric it is necessary to train and calibrate the raters to use the rubric well to assure that ratings are accurate and consistent across all raters. Ratings using rubrics cannot be benchmarked against national comparison groups or compared to other ratings made by other rater groups.
Rubrics are a popular approach when the goal is largely developmental; rubrics are good pedagogical tools. Issues arise, however, when rubrics are used for summative assessment.
- Rubrics are very imprecise measures. Typically a good rubric will have three to five categories. Any more than that and applying the rubric in practice falls apart. In this way rubrics are analogous to grades. The range F-A is really more than most of us can handle and we often report 95% of our grades in the C-A range. We sense that overly fine distinctions in grading might be illusory because of the multiplicity of factors which go into giving a fair grade. Rubrics address this by attempting to focus our attention on only a single dimension, e.g. critical thinking, of what may go into grading. But even so, our minds are not nearly as capable of making the kinds of discriminations which a well designed test can make. We need to keep it simple; 3-5 categories are plenty. With rubrics as with grades, there will be lots of “B” students.
- Construct validity is a concern with home-grown rubrics. Locally designed rubrics, committee-derived conceptualizations, and self-report scoring provide useful data, but they don’t provide the objectivity necessary to support assertions about program excellence. Does the group writing the rubric have a good grasp of the target construct? Is that group independent, fair-minded and strong enough not to be drawn into the “local meanings” pit? In other words, as related to critical thinking, does the rubric measure critical thinking as that concept is most widely understood, or does the rubric only reinforce a local meaning that is too heavily weighted toward one discipline or another and which does not connect well with what the larger world means by critical thinking?
- Reliability is a concern when untrained raters apply rubrics. Are those who will apply the rubric well trained in its use so that inter-rater reliability is achieved? Even if the rubric is good, the raters may apply it with such variability that the score an individual project receives can differ widely. National workshops time and again demonstrate how a group of faculty can rate the same student essay and yet when all the ratings are pulled together, a bell shaped curve results, not the clear consensus that expected. This demonstrates the importance of training raters with paradigmatic examples and practicing before doing the actual ratings. A variation of the reliability problem occurs when faculty rate the work of their own students, except here the strong tendency is to give higher ratings that people from other departments might assign.
- Confounding the target with other things is a concern when applying rubrics. The problem of reliability in the application of the rubric is matched by the problem of the validity of the application, meaning there is a tendency with rubrics to forget what we are supposed to be evaluating. Instead of looking only at the critical thinking, for example, raters may also mix in a little about their impressions about the writing style or the content knowledge they may be unable to give due critical thinking credit for things like irony or satire. So a Jon Stewart editorial might get a low score on a critical thinking rubric because the raters did not dig deep enough to see the arguments that undergird his satire.
The validity and reliability of rubrics (rating forms) is judged by the Kappa Statistic, and as a result this is a potentially weaker measure of critical thinking than the other validated standardized instruments available through Insight Assessment.