Abstract
An intelligent agent learns the user's personal sketch annotations by
gathering, integrating, and interpreting sketch and speech input. This
agent-assisted, multi-modal interaction affords a natural and adaptable
approach to graphical annotation of a personal datebook.
Keywords:
multi-modal interface, sketch input, speech input, interaction design, intelligent
learning agent
Fig. 1. Calendar graphical interface.
The user has sketched an annotation to mark an appointment as important.
Introduction
This project explores an alternative method of interaction based on a
multi-modal approach augmented by an intelligent interface agent.
Research attempting to capture the richness and subtlety of human
communication has shown that a multi-modal approach of combining speech
with three- dimensional gesture can be successful in several domains
[2][4].
This project combines speech with two-dimensional gesturing through
hand-drawn marks. Sketching is an intuitive and ubiquitous way in which
we communicate ideas -- sketch makes the mind's creative process visible
to the eye and feeds the designer's discursive process [1]. When
annotating text in the age of the word processor, many users (the author
included) still waste paper by printing out rough drafts simply to take
advantage of the advantages of personal sketch annotation.
My approach differs from previous multi-modal approaches to
computer-human interaction in that it incorporates an intelligent agent
that watches, listens, and learns the user's individual sketch notation.
The use of a learning agent offers the advantages of supporting
differing user interaction preferences in terms of the amount of speech
input desired and the types of gestural annotations used. When
annotating visual material, most people use a set of symbols common in
our culture in a personal notation which can be easily understood with a
little bit of explanation or contextual information. Although often the
symbols used are similar, not all symbols have the same meaning for all
people [3]. Furthermore, one symbol can have many different meanings to
the same person. Context can be used to disambiguate between different
potential meanings.
An electronic calendar was chosen as an application domain because many
people maintain a datebook and use their own personal symbolic notation
to block out times or indicate special appointments, etc. Pages from
several datebooks were collected and studied as examples of how people
mark their personal calendars.
My electronic calendar affords a natural and adaptable approach to
graphical annotation of a personal datebook. The sketch and speech
input are gathered, integrated, and interpreted by a learning agent.
The agent learns what the user signifies by her own personal gestural
notations by examining the accompanying speech and the context in which
the mark was made. The interaction is designed to resemble the way one
person might explain to another person, as she is sketching, what her
sketches signify.
Fig. 2. "Reschedule this here."
The user sketches while verbally telling the agent
the meaning of the sketch. The agent learns that
an arrow signifies the movement of an appointment
to a new timeslot.
IMPLEMENTATION
The graphical user interface of the calendar is shown in Fig. 1. The
user can use a pen or a mouse to sketch on the full week that is
displayed across the bottom half of the screen, in order to add, delete,
move or highlight an appoinment. The interface agent recognizes the
user's sketch [5] and translates it into an intended annotation by
examining the accompanying speech [6] and the context in which the mark
was made. The annotation is then interpreted graphically using color
and text. Within a few seconds, the sketch fades away.
In one possible scenario, the user wants to reschedule a meeting (Fig.
2). As the user sketches an arrow from the meeting's current time to
its desired time, she might say "Reschedule this here." The interface
agent, which has knowledge of how an arrow looks, how to interpret it by
breaking down the sketch into an origin and a destination, and what it
signifies based on its accompanying speech, moves the meeting to its new
timeslot.
When a sketch is received by the agent, the agent uses a reinforcement
learning technique known as Q-learning [7] to select an appropriate
action. The learning matrix consists of 12 possible types of sketches,
or situations, matched against five possible actions. If there is no
correlative speech accompanying the sketch, the action selected by
consulting the learning matrix is executed, the resulting situation is
assessed, and reinforcement occurs according to the Q-learning
technique. If there is correlative speech, the agent compares the
action indicated by the speech with the action chosen by consulting the
learning matrix. If the actions indicated by the speech and by the
sketch are the same, the learning matrix is reinforced positively and
the action is carried out. If the actions are different, the action
chosen by the speech is carried out and the action in the learning
matrix is reinforced negatively by the same amount. If the action is
evaluated by the agent as being pointless in its particular context
(i.e., deleting an appointment that does not exist), negative
reinforcement occurs. A resulting learning curve (Fig 3) indicates
rapid learning by the agent.
In addition to providing rapid learning, the Q-learning technique also
provides flexibility. A user may easily change or move between
different sets of sketch annotations by simply verbally telling her
agent what they signify.
Fig. 3 Agent learning curve.
The x-axis shows the number of annotations made
by the user, while the y-axis shows the percentage
of correct guesses made by the agent. After only
10 annotations the agent's hit rate jumps to 60%,
showing rapid learning.
Fig. 4. Examples of possible sketch annotations.
CONCLUSION
This experiment successfully demonstrates the use of an intelligent
interface agent that learns a user's personal sketch annotation using
integrated sketch and speech input. Implementation of a Q-learning
reinforcement technique is shown to provide rapid and flexible learning.
Informal experiments among my colleagues supports my hypothesis that
multi-modal interaction combining speech and sketch is natural and
intuitive for the user; however, future work should include user testing
and evaluation of the speech-dominant approach to learning personal
sketch annotation.
Acknowledgments
This research is being conducted at the Visible Language Workshop of the
MIT Media Laboratory. I would like to thank my colleagues for their
support and inspiration.
This work was sponsored in part by ARPA, NYNEX, Alenia, and JNIDS.
Software used in this implementation included Dean Rubine's
Single-Stroke Gesture Recognition software and IBM Continuous Speech
Software.
References
[1] Arnheim, Rudolf. Sketching and the Psychology of Design. Design
Issues: Vol. IX, Number 2, Spring 1993, pp. 15-19.
[2] Bolt, Richard A. Put-That-There: Voice and Gesture at the Graphics
Interface. Proceedings of SIGGRPAH 1980, ACM/SIGGRAPH, NY, 1980, pp.
262-270.
[3] Gross, M. D. Indexing visual databases of designs with diagrams.
Proceedings of Conference on Design Decision Support Systems II
Symposium on Visual Databases in Architecture, Vaals, Netherlands, 1994.
[4] Koons, David B and Carlton J. Sparrell. ICONIC: Speech and
Depictive Gestures at the Human- Machine Interface. Proceedings of
CHI94, ACM, Boston, MA, 1994.
[5] Rubine, Dean. The Automatic Recognition of Gestures. PhD
Dissertation, Carnegie-Mellon University, 1991.
[6] Schmandt, Christopher. Voice Communication with Computers:
Conversational Systems. Van Nostrand Reinhold: New York, 1994.
[7] Sutton, Richard S. "Reinforcement Learning Architecture for
Animats," Animals to Animats, edited by Jean-Arcady Meyer and Stewart W.
Wilson. 1991.