The Difference of Preference versus Performance Can Differ for Concurrent versus Retrospective Ratings

Srinivas Raghavan
Qualcomm Inc.,
5775 Morehouse Dr.
San Diego, CA 92121.
raghavan@qualcomm.com

Gary Perlman
OCLC Online Computer Library Center
6565 Frantz Road
Dublin, OH 43017
perlman@oclc.org

This paper appeared in the proceedings of HFES 2000, the 44th Annual Meeting of the Human Factors and Ergonomics Society, July 30 - August 6, 2000, San Diego, California, USA.

Abstract

Several studies have found differences between subjective preference ratings and objective performance measures. Bailey [Bailey 93] summarizes several, and argues for separate treatment of these concepts. Our results in a multifactor multivariate experiment support Bailey's contention, but adds a new dimension of concern: the use of concurrent versus retrospective subjective ratings. The presentation here will focus on the relationship of performance and concurrent versus retrospective preference ratings. Retrospective ratings may represent users' lasting impressions of a system after a trial use, but may not be good predictors of performance. Concurrent ratings of confidence of accuracy were found, in this study, to be better predictors performance. We offer recommendations about how to make the best use of these different evaluation measures, particularly when they differ.

Introduction

The results reported here is a serendipitous set of results that were observed while researching model-based linking in hypertext. The research explored the idea that an entity-relationship model of data could be used to automatically generate useful hypertext links between related information [Raghavan, 1998]. In the research, links were generated based on entity-relationships and evaluated in online and paper forms against search/index-based access methods.

Both objective (performance) and subjective (preference ratings) measures can be used to study the differences in the use of relationship links to seek information. Subjective preference ratings can be collected concurrent with the performance of the tasks and retrospectively (e.g., through a post-test questionnaire or interview).

In this paper we report the results of our study which shows that the difference of preference versus performance can differ for concurrent versus retrospective ratings.

Why study the ratings of performance when you can use the performance measures? Many products are released without any performance measures, and only very informal subjective impression, such as in a focus group, or worse yet, from management/ marketing. Collecting subjective measures is a way to estimate that biased influence. Besides, if performance does not match preference, the design choices might need to be explained more carefully than when they agree.

Experiment

An experiment evaluated the usefulness of entity-relationship-based links in accessing online versus print information. Subjects sought answers to questions in each of four comparison conditions:

Linked-Online (LO): web-based tool implementing entity-relationship-based links
Unlinked-Online (UO): web-based keyword search
Linked-Paper (LP): cross-referenced documents
Unlinked-Paper (UP): indexed documents

The performance of subjects answering the questions was assessed with three dependent variables:

Task Completion (Actual) Time (AT)
Actual Accuracy (AA)
Concurrent Confidence Ratings of Accuracy (CA)

A post-test questionnaire gathered subjective (retrospective) ratings for each condition:

Retrospective Speed (RS)
Retrospective Accuracy (RA)
Retrospective Usability (RU)

Results

All results have been "normalized" to be on comparable scales of measure. Predicted time was based on "designer's intuition". Although potentially biased to favor the designer's ego, it is worth presenting here as a contrast to objective (actual time) and subjective (retrospective speed) measures.

Actual time (in seconds) correlated poorly with Predicted time, but correlated well (r=0.49) with concurrent ratings of accuracy (low times were accompanied by high ratings of accuracy). Concurrent ratings of accuracy (which had significantly lower confidence for Linked-Paper) correlated well with Actual Accuracy and Subjects were significantly more confident when correct (F(1,510)=87.4, p<0.001). Actual Accuracy varied little across conditions (F(3,45) = 0.1). Concurrent ratings of accuracy correlated well with retrospective ratings of accuracy (r=0.62). However, Actual Accuracy did not correlate well with retrospective ratings of accuracy (r=-0.36). All retrospective ratings were correlated (r=0.65, r=0.32, r=0.55), and these correlated best with the designer's predicted times (e.g., for speed ratings, r = -0.44).

The results are summarized in Table 1.

Table 1 Summary of Objective and Subjective (Concurrent & Retrospective) measures across the four major conditions Linked Online (LO), Unlinked Online (UO), Linked Paper (LP), and Unlinked Paper (UP)

Online Paper

Linked (LO) Unlinked (UO) Linked (LP) Unlinked (UP)

Predicted Time (PT) < < <

Objective

Actual Time (AT) (secs) 83.3
(6.2) 106.6
(9.3) 121.3
(8.2) 104.6
(7.7)

Actual Accuracy (AA) (%) 77.3
(3.7) 78.1
(3.8) 75.8
(3.7) 76.6
(3.8)

Subjective
Concurrent Accuracy (CA) 9.1
(.17) 8.9
(.24) 8.3
(.17) 8.9
(.15)

Retro Time (RT) 8.2
(0.3) 6.9
(0.6) 4.7
(0.6) 4.3
(0.5)

Retro Accuracy (RA) 8.8
(0.3) 8.4
(0.4) 7.1
(0.5) 6.8
(0.5)

Retro Usability (RU) 7.7
(0.3) 6.9
(0.4) 4.9
(0.4) 4.4
(0.4)

	Online	Paper
	Linked (LO)	Unlinked (UO)	Linked (LP)	Unlinked (UP)
Predicted Time (PT)	<	<	<
Objective
Actual Time (AT) (secs)	83.3 (6.2)	106.6 (9.3)	121.3 (8.2)	104.6 (7.7)
Actual Accuracy (AA) (%)	77.3 (3.7)	78.1 (3.8)	75.8 (3.7)	76.6 (3.8)
Subjective
Concurrent Accuracy (CA)	9.1 (.17)	8.9 (.24)	8.3 (.17)	8.9 (.15)
Retro Time (RT)	8.2 (0.3)	6.9 (0.6)	4.7 (0.6)	4.3 (0.5)
Retro Accuracy (RA)	8.8 (0.3)	8.4 (0.4)	7.1 (0.5)	6.8 (0.5)
Retro Usability (RU)	7.7 (0.3)	6.9 (0.4)	4.9 (0.4)	4.4 (0.4)

Figure 1 shows a combined view of the three measures Actual Time (AT), Actual Accuracy (AA), and Concurrent Confidence ratings of Accuracy (CA) which are presented as percentages (bars) on the primary Y-axis. To allow for comparisons, the bars for each measure are clustered together. Time is presented as points connected by lines on the secondary Y-axis. Error bars indicate one standard error of the means.

Figure 1 Actual Time (AT), Actual Accuracy (AA), and Confidence ratings of Accuracy (CA) for major conditions

The subjective (retrospective) results are summarized in Figure 2. The graph shows a combined view of all three ratings. The ratings for Retrospective Speed and Accuracy are presented (bars) on the primary Y-axis. Retrospective Usability is presented as points connected by a line on the secondary Y-axis. Error bars indicate one standard error of the means.

Figure 2 Subjective ratings: Retrospective Speed (RS), Accuracy (RA), and Usability (RU) of the four major conditions

Discussion

Designer-intuition predicted time did not correlate well with actual time. Actual time correlated well with Concurrent confidence ratings, which might make one want to generalize that subjective confidence ratings about accuracy are good predictors of performance time, but one would be less eager given that actual-accuracy did not differ across conditions. The use of isolated measures becomes even more tenuous when we look at the retrospective ratings, which all correlated well with predicted time; If only subjective retrospective ratings were collected as data, one might conclude that the designer's intuition was perfect. But the retrospective time and accuracy scores seem to have lost the poor performance (Actual time and Concurrent confidence ratings) of the Linked-Paper condition (all conditions were counter-balanced in a latin square).

It is instructive to consider the conclusions that might be drawn if we had measured fewer dependent variables.

No Dependent Variables (all too common in system development):
Had we just depended on designer's intuition, we would have concluded that links online and on paper were beneficial, when in fact we found they were only beneficial online. Printed versions of the information would be more quickly searched for information with a traditional index.
Only Objective Measures:
Had we collected only objective measures, we would be on firm ground for making decisions, but we might have ignored user perceptions of the system. In our study, users' concurrent confidence ratings mirrored their actual time, and accuracy, but the retrospective ratings did not. The retrospective ratings more strongly favored linked paper versions over unlinked (indexed). If we delivered a product that printed indexes instead of cross-reference links, it would be useful to know that users thought links were more effective.
Only Subjective Measures:
If we only collected subjective measures, then we would have delivered a system that made people less productive. Clearly this is undesirable, but it may be the most common practice when systems are shown to focus groups, demonstrated at conventions (or now, downloaded for evaluation on the Web) or otherwise evaluated by informal review methods.
Only Concurrent Measures:
Within the subjective measures, concurrent ratings seemed to have more to do with actual performance than retrospective ratings. Retrospective subjective ratings were well correlated with our predictions, but not with objective performance, and not well-correlated with concurrent subjective measures. Regardless, if we accept the retrospective ratings as reliable, we are faced with promoting a design that is recalled as being less useful.

Conclusions

We found that retrospective ratings of accuracy, time, and usability to be less related to objective measures than concurrent confidence ratings. If objective and subjective measures are inconsistent, then we would anticipate needing to explain the benefits of a more effective design. The more they are at odds, the more we might want to find ways to make the most effective design the most positively received.

Summary

In summary, in this experiment, the gathering of retrospective usability ratings has helped to demonstrate that they may not serve well as measures of true performance, and could have, if collected as the only dependent measure, have been used to confirm incorrect predictions. On the other hand, if retrospective ratings are to be used to measure an overall impression that a user takes with them from an experience, then the uncorrelated actual data (e.g., the objective time measure) may be less useful in predicting future purchase/use behavior (see [Davis 1989]).

References

[Bailey, 1993] R.W. Bailey (1993) Performance vs. Preference. In Proceedings of the Human Factors and Ergonomics Society 37^th annual meeting 1993 v.1 pages 282-286.
[Davis, 1989] F.D.Davis (1989) Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Quarterly, 13:3, pages 319-340.
[Raghavan, 1998] Raghavan, S. (1998) Empirical evaluation of the usefulness of model-based relationship links in seeking committee information. PhD Thesis, The Ohio State University, USA.