



Will Hill, Larry Stead, Mark Rosenstein and George Furnas
Bellcore, 445 South Street, Morristown, NJ 07962-1910
lstead@bellcore.com, gwf@bellcore.com
With vast stores of multimedia events and objects to choose
from, future users of the national information infrastructure
will be overwhelmed with choices and human-computer
interface designers will be called upon to address the problem. The aim of this research is to evaluate the
power of a
particular form of virtual community to help users find
things they will like with minimal search effort.
Taking video selection as an initial test domain, the technique compares a viewer's personal ratings of
videos with
those of hundreds of others to find people with similar preferences and then recommends unseen videos that
these sim
ilar people have viewed and liked. The technique
outperforms by far a standard source of movie recommendations: nationally recognized movie critics.
The term community means "a group of people who share
characteristics and interact". The term virtual means "in
essence or effect only". Thus, by virtual community we
mean "a group of people who share characteristics and
interact in essence or effect only". In other words, people in
a Virtual Community influence each other as though they
interacted but they do not interact. Thus we ask: "Is it possible to arrange for people to share some of the
personalized
informational benefits of community involvement without
the associated communications costs?" Such costs might
include for example, the time costs of developing a personal
relationship, costs to privacy, costs of synchronous face-to-face communications.
We wish to contrast our idea of virtual community with two
popular themes in human interface work: virtual reality and
intelligent agents. First we draw the contrast with virtual
reality.
Popular future visions of networked computing and infrastructure marry perceptual immersion in virtual
reality to
high-bandwidth telecommunications. They seek a photorealistic and real-time "cyber-face to cyber-face"
social
environment [10]. This immersive vision expects total
involvement from participants. The result is what might be
called a virtual reality community with its central issues of
visual, auditory and temporal fidelity. By virtual community
we do not mean virtual reality community. The pitfalls of
seeking higher and higher fidelity to face-to-face communi
cation have been well discussed in Brothers et al. [2]. Virtual community is about attempting to realize
some of the
benefits of community without the associated communications costs.
A second popular vision of networked computing and infrastructure paints scenarios which include a large
role for
"intelligent agents". The idea is that of semi-autonomous
programs somehow endowed with intelligence great enough
to impress us with their ability to interpret our needs and
their work on our behalf. Our notion of virtual community
includes no central role for intelligent agents other than the
human participants in the virtual community.
Malone et al. [7] propose three types of information filtering
activities: cognitive, economic and social. Cognitive activities filter information based on content.
Economic filtering
activities filter information based on estimated search cost
and benefits of use. Social activities filter information based
on individual judgments of quality communicated through
personal relationships. This paper concentrates upon the
computer-assisted mediation of Malone's third type: social
filtering activities. However,a basic thesis of this work is
that personal relationships are not necessary to social filter
ing. In fact, social filtering and personal relationships can be
teased apart and put back together in interesting new ways.
For instance, the communication of quality judgments can
occur through less personal, and even impersonal relation
ships as well as personal relationships. Obviously, people
want a satisfying mix of both personal and impersonal relationships.
We have been particularly interested in how social filtering
activities can be simultaneously streamlined and enriched
through the careful design of communication media. The
social relationships in which filtering of information occurs
can be streamlined by making them less personal and
enriched by making them more personal. For example, add
ing or removing the communications costs of synchronous
face-to-face encounter, anonymity, and choosing a more
personal medium such voice or a less personal medium such
as text are all means of influencing the personal aspects of
communication. Social filtering can be simultaneously
streamlined and enriched by making some aspects of a relationship less personal while making other aspects
of the
relationship more personal.
In the realm of computer-assisted mediation of social filter
ing, a few HCI experiments sparsely dot the space of possible designs. Goldberg's Tapestry system [3] is a
site
oriented email system encouraging the entry of free text
annotations with which on-site users can later filter messages. Annotations are rich in high quality
information and
their successful uses are valuable. However, despite hopes
to the contrary, the twin tasks of writing annotations to enter
filtering data and specifying queries to use filtering data
require significant user effort. Domains where the invested
efforts pay off readily are few, but they do exist. In the case
of annotations where the method of entering filtering information for the benefit of others has significant
user costs,
Grudin's question [4] "Who does the work and who gets the
benefit?" becomes noticeably relevant.
Reacting against the trend of interface designers loading
additional tasks on users in order to help them find things,
the history-enriched digital objects approach (HEDO)
[5][6][11] attempts to explore a region of the interface
design space that minimizes additional user tasks. Through
a combination of automatic interaction history and graphics,
depictions of communal history within interface objects hint
at their use while user effort is minimized. HEDO tech
niques record the statistics of menu-selections, the count of
spreadsheet cell recalculations and time spent reading documents (e.g., email, reports, source-code,) in a
line-by-line
manner summing over sections and whole documents. Displays are simple shadings on menus, spreadsheets
and docu
ment scroll bars. Because the HEDO data are less
informative than annotations, they tend to be less useful, but
they cost less to gather and use. There is evidently a trade
off here.
One way to think about the trade-off is considering the two
approaches to social filtering mentioned so far as two ends
of spectrum. On one end of the spectrum we have social filtering interfaces that expect more work from the
user and
give more value. On the other end of the spectrum we have
interfaces that expect no additional work from the user but
provide less value. Our thought is that perhaps somewhere
in the middle of this spectrum between the two end alternatives, there might lie special niches that offer
relatively
more filtering value for relatively less filtering work. Such
locations on the spectrum, if they existed, we could call
design "sweet spots".
Figure 1 depicts the spectrum and
places a "sweet spot" in the middle.
We have in mind the ideal of a community of users routinely
entering personal ratings of their interest concerning digital
objects in the simplest form possible: a single keypress or gesture. These evaluations are pooled and
analyzed automatically
in service of the community of use. Members of this community, at their pleasure, receive recommendations
of new or
unfamiliar digital objects that they are likely to find interest
ing.
Recommendations might, for instance, take the form of recommendation-enhanced browse-products that
tatoo symbols
of predicted interest upon object navigation and control
points. Later on, Figure 4 shows such a Mosaic Browsing
interface with recommendation enhanced hypermedia links
and menus.
Of course the question is: does this kind of virtual community
work? The answer as we will show is "yes" for videos and
probably yes for many other forms of consumer level information items: books (categorized by author),
video games,
gaming scenarios, music, magazines and restaurants.
Concerning the use of ratings, Allen [1] reported unencouraging results on one of the first investigations
(known to us)
into personal ratings for HCI-type user-modeling. Recently,
Resnick et al. [9] have designed a social filtering architecture based upon personal ratings and demonstrated
its appli
cation to work-group filtering of Netnews. In a study of
eight users reading 8000 Netnews messages, Morita and
Shinoda [8] observed strong positive correlations between
time spent reading messages and personal interest ratings of
those messages. Their work suggests it might be possible
for time-on-task measures to stand in for ratings, further
reducing user tasks.
In the process of achieving our overall goal of making personal evaluations do significant interface work for
a virtual
community, our approach illustrates a number of supportive
community-oriented design goals:
Our design also embodies two research tactics.
In order to understand the power of recommending and evaluating choices in a virtual community, we posed
three basic
questions:
The second and third of these questions deserve further comment. The second question is straight-forward
and standard
statistical methods apply for answering it. On the third question, no standard measures have emerged as a
consensus. At
present, we consider two measures: (1) In a split-data test,
how well do item ratings predicted by the recommending/
evaluating system correlate with actual ratings submitted by
users? (2) How do users evaluate the results they see from the
algorithms? We report on these measures in the Results section.
Our method was to seed a virtual community in the Internet
and to do all the work necessary to exchange high quality recommendations among participants. People
participated (and
still participate) through an email interface at videos@bellcore.com. From October 1993 through May 1994
we col
lected data on how the virtual community functions, how
people like it, and how well it performs for participants.
The virtual community support provided by at videos@
bellcore.com consists of a generic object-oriented database to
store and access preference efficiently and give out recommendations and evaluations. It is generic in the
sense that one
can construct various domains of items: videos, restaurants,
books, document pages, and places to visit. In particular, at the
time of our analysis, videos@bellcore.com included a data set
of 55,000+ ratings of 1750 movies by 291 users. It includes
recommending algorithms whose predictions improve as the
data grow, and the number of movies, users and ratings and
continues to grow daily.
The database is organized as set of interrelated instances of
object classes. The objects are:
The database contains 17 modules. A single high level data
base interface consisting of the following functions suffices to
control it in most circumstances: load-database, save-database,
add-user, erase-user, add-item, erase-item, add-ratings,
recommend-items, evaluate-items.
Internet participants send a message containing "subject: ratings"
to videos@bellcore.com. The system replies with an
alphabetical list of 500 videos for the user to evaluate on a
scale of 1-10 for the titles they have seen. Rating 1 is low and
10 is high. Users may also rate an unseen movie as "must-see"
or "not-interested" as appropriate. Surprisingly, early usability
tests showed that it was reasonable to expect self-selected
Internet users to rate movies on an alphabetical list of 500
movies. However we do not expect this to be a feature of a
deployed system. In order to reduce item/item bias, for every
participant 250 of the 500 movies listed are selected randomly.
To increase rating hits and to gather a standard set of
data for purposes of fair comparison, for every participant the
remaining 250 titles are a fixed set of popular movies.
When users return their movie ratings to videos@bellcore.
com, an EMACS client process parses the incoming message,
and passes ratings data inside a request for a recommendations-text
to the server database process. The server process
performs add-user, add-ratings and recommend-items. In the
initial phase of adding ratings for a new user, ratings are added
not only in the 1-10, "must-see" and "not-interested" categories,
but also in the "unseen" category for titles that the user
could have rated but did not. These unseen movies are the first
pool from which to compute recommendations.
When a user is new, the database first looks for correlations
between the new user's ratings and ratings from a random
subsample of known users. We use the random subsample to
limit the number of correlations computed to be O(n) rather
than O(n2) in the number of participants. One-tenth of the new
user's ratings are held out from the analysis for later quality
testing purposes. The most similar users found are used as
variables in a multiple-regression equation to predict the new
user's ratings. The generated eq uation is then evaluated by
predicting the held out one-tenth of the new user's ratings and
then correlating these predictions with the actual ratings.
Once the predication equation exists, it is quite fast to evaluate
every unseen movie, sort them by highest prediction and skim
off the top to recommend. When recommended, movies are
marked in the database as "pending-as-suggestion". A recommendation
text is generated and passed back to the EMACS
front-end client process where it is mailed back to the user or
users.
The Internet email interface is currently a subject-line command
interface and there are many commands for specialized
actions. Further details are available by sending mail to
videos@bellcore.com.
Here is sample reply from the system. Names have been
changed to protect anonymity:
Suggested Videos for: John A. Jamus.
Your must-see list with predicted ratings:
The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers
were found to be most similar.
Correlation with target viewer:
Suggested Videos for: Jane Robins, Jim Robins, together.
Your video categories with average ratings:
We have algorithms for two purposes, recommending items
and evaluating items. Having tried a few versions of each, we
report on the best we have discovered so far. We do not have
evidence that these are the best algorithms possible, only that
they are good. The algorithms we use for recommending have
the following abstract functional form:
The function to return an evaluation of a proposed choice
looks like this:
Currently the database consists of 291 participants in the
community, 55,000 ratings on a 1-to-10 scale, another 2100
"must-see" or "not-interested" ratings, 64,000 "unseen" and
1200 "pending-as-suggestion" ratings. Of the 1750 movies in
the database, 1306 have at least one rating and 739 have at
least 3 ratings. 208 movies have more than 100 ratings, and 2
movies have more than 200 ratings. Users rate an average of
183 movies each with a standard deviation of 99. More than
220 of 291 total participants rated more than 100 movies. The
database is small, but large enough to conservatively but
accurately estimate a number of performance parameters.
For the 739 movies that have three or more ratings.
Figure 2
shows the distribution of movies by their mean rating. Notice
the slight bias toward positive ratings.
Six weeks after they initially tried videos@bellcore.com for
the first time by submitting ratings and receiving
recommendations, 100 early users were asked to re-rate exactly the same
list of movie titles as they had rated the first time. 22 volunteers
replied with a second set of ratings. Three outliers were
removed from the reliability analysis since they correlated
perfectly and were evidently copies of the original ratings
rather than second independent sets of ratings. For the remain
ing 19 users, on movies rated on both occasions, the Pearson r
correlation between first-time and second-time ratings six
weeks apart was 0.83 . This number gives a rough estimate
how reliable a source of information the ratings are.
We held out 10% of every participant's movie ratings to provide
a cross-validation test of accuracy. The cross-validated
correlation of predicted ratings and actual ratings estimates
how well our recommendation method is working. Figure 3
shows that our current best similar viewers algorithm correlates
at 0.62 with user ratings. This is a strong positive correlation
which means the recommendations are good. How good?
We may expect three out of every four recommendations will
be rated very highly by a potential viewer. We compared the
quality of our virtual community recommendation method to
a standard method of getting recommendations, that is, following
the advice of movie critics. The ratings of movies by
two nationally-known movie critics were entered. Their ratings
correlate much more weakly at only the 0.22 level with
viewer ratings. Thus the virtual community method is dramatically
more accurate, as Figure 3 also shows.
Email responses from videos@bellcore.com include a request
for open-ended feedback. Out of 51 voluntary responses, 32
were positive, 14 negative and 5 neutral. Here are some sample
quotes:
Open ended feedback from users also indicated interest in
establishing direct social contacts within their virtual community.
Users can participant in either an anonymous or signed
fashion. Interestingly, only four users exercised the anonymity
option. Wishing to extend the social possibilities of the virtual
community, two users asked if they could set "single and
available" flags in the community indicating they wanted to
use the community as a means of dating. One user found a
long lost friend from junior high school. Another wrote that he
took the high correlation between his movie tastes and those
of someone he was dating as evidence for a long future relationship.
One of the standard uses of reliability measures is to put a
bound on prediction performance. The basic idea is since a
person's rating is noisy (i.e., has a random component in
addtion to their more underlying true feeling about the movie)
it will never be possible to predict their rating perfectly. Standard
statistical theory says that the best one can do is the
square root of the observed test-retest reliability correlation.
(This is essentially because predicting what the user said once
from what they said to the same question last time has noise in
at both ends, squaring its effect. The correlation with the truth,
if some technique could magically extract it, would have the
noise in only once, and hence is bounded only by the square
root of the observed reliability). The point to note here is that
the observed reliability of 0.83 means that in theory one might
be able to get a technique that predicts preference with a correlation
of 0.91. The performance of techniques presented
here, though much better than that of existing techniques, is
still much below this ideal limit. Substantial improvements
may be possible.
We see a potential for deployment to customers of national
information access who will be faced with thousands of
possible choices for information and entertainment, in addition
to videos.
We have instantiated a version of our server where items are
World Wide Web URLs (universal resource locators) in place
of videos. Figure 4 displays a modified Mosaic browser interface that accepts
ratings of WWW pages on a slider widget
(near bottom) and reports them to an appropriate virtual community
server. When a user clicks on the Recommend URL
button (near bottom), the browser contacts the virtual community
server to get recommended URLs and then fetches the
recommended page. It also displays next to every hypertext
link, one-half to four stars which represent the virtual community's
predicted value of chasing down the hypertext link.
One direction in which we plan to push the research is toward
more individual and social aspects. In particular we are interested
in distributed peer-to-peer versions rather than the centralized
client/server version that we have now. A wireless
deployment of a peer-to-peer version could include wearable
PCS devices, pairs of which will, when in close physical proximity,
exchange ratings data for local virtual community computation.
See Community and History-of-Use
Navigation Home Page for further information.
Keywords:
Human-computer interaction, interaction history, computer-supported cooperative
work, organizational
computing, browsing, set-top interfaces, resource discovery,
video on demand.
Introduction
Virtual community, not virtual reality nor intelligent
agents
Relation of current work to previous research
Interface Design Goals
The Research Questions
METHOD: AN INTERNET CONCEPT TRIAL
How Virtual Community Technology Works
Organization of the Database
The Email Interface
Options
Instructions are also given for exercising various options in
the community. For example, one can order up joint recommendations
for more than one person and from a particular set
of community members. This second example shows both
capabilities at once. Jane and Jim want a joint recommendation
of what movie to watch together. They also want recommendations
only from Mary and Dick rather than the
community at large. Again names have been changed.
THE COMMUNAL HISTORY-OF-USE ALGORITHMS
RESULTS
The Data
Reliability
Cross-validated Correlation Study
User Feedback
The Upper Limit
FUTURE DIRECTIONS
Virtual Community recommending in the Mosaic
Interface to the World Wide Web.
Community Headroom