Mark Your Calendar!
Learning Personalized Annotation from Integrated Sketch and Speech

Robin L. Kullberg

Visible Language Workshop, MIT Media Laboratory
20 Ames St, E15-443, Cambridge, MA 02139
robin@media.mit.edu

Abstract

An intelligent agent learns the user's personal sketch annotations by gathering, integrating, and interpreting sketch and speech input. This agent-assisted, multi-modal interaction affords a natural and adaptable approach to graphical annotation of a personal datebook.

Keywords:

multi-modal interface, sketch input, speech input, interaction design, intelligent learning agent

Fig. 1. Calendar graphical interface. The user has sketched an annotation to mark an appointment as important.

Introduction

This project explores an alternative method of interaction based on a multi-modal approach augmented by an intelligent interface agent. Research attempting to capture the richness and subtlety of human communication has shown that a multi-modal approach of combining speech with three- dimensional gesture can be successful in several domains [2][4].

This project combines speech with two-dimensional gesturing through hand-drawn marks. Sketching is an intuitive and ubiquitous way in which we communicate ideas -- sketch makes the mind's creative process visible to the eye and feeds the designer's discursive process [1]. When annotating text in the age of the word processor, many users (the author included) still waste paper by printing out rough drafts simply to take advantage of the advantages of personal sketch annotation.

My approach differs from previous multi-modal approaches to computer-human interaction in that it incorporates an intelligent agent that watches, listens, and learns the user's individual sketch notation. The use of a learning agent offers the advantages of supporting differing user interaction preferences in terms of the amount of speech input desired and the types of gestural annotations used. When annotating visual material, most people use a set of symbols common in our culture in a personal notation which can be easily understood with a little bit of explanation or contextual information. Although often the symbols used are similar, not all symbols have the same meaning for all people [3]. Furthermore, one symbol can have many different meanings to the same person. Context can be used to disambiguate between different potential meanings.

An electronic calendar was chosen as an application domain because many people maintain a datebook and use their own personal symbolic notation to block out times or indicate special appointments, etc. Pages from several datebooks were collected and studied as examples of how people mark their personal calendars.

My electronic calendar affords a natural and adaptable approach to graphical annotation of a personal datebook. The sketch and speech input are gathered, integrated, and interpreted by a learning agent. The agent learns what the user signifies by her own personal gestural notations by examining the accompanying speech and the context in which the mark was made. The interaction is designed to resemble the way one person might explain to another person, as she is sketching, what her sketches signify.

Fig. 2. "Reschedule this here."
The user sketches while verbally telling the agent
the meaning of the sketch. The agent learns that
an arrow signifies the movement of an appointment
to a new timeslot.

IMPLEMENTATION

The graphical user interface of the calendar is shown in Fig. 1. The user can use a pen or a mouse to sketch on the full week that is displayed across the bottom half of the screen, in order to add, delete, move or highlight an appoinment. The interface agent recognizes the user's sketch [5] and translates it into an intended annotation by examining the accompanying speech [6] and the context in which the mark was made. The annotation is then interpreted graphically using color and text. Within a few seconds, the sketch fades away.

In one possible scenario, the user wants to reschedule a meeting (Fig. 2). As the user sketches an arrow from the meeting's current time to its desired time, she might say "Reschedule this here." The interface agent, which has knowledge of how an arrow looks, how to interpret it by breaking down the sketch into an origin and a destination, and what it signifies based on its accompanying speech, moves the meeting to its new timeslot.

When a sketch is received by the agent, the agent uses a reinforcement learning technique known as Q-learning [7] to select an appropriate action. The learning matrix consists of 12 possible types of sketches, or situations, matched against five possible actions. If there is no correlative speech accompanying the sketch, the action selected by consulting the learning matrix is executed, the resulting situation is assessed, and reinforcement occurs according to the Q-learning technique. If there is correlative speech, the agent compares the action indicated by the speech with the action chosen by consulting the learning matrix. If the actions indicated by the speech and by the sketch are the same, the learning matrix is reinforced positively and the action is carried out. If the actions are different, the action chosen by the speech is carried out and the action in the learning matrix is reinforced negatively by the same amount. If the action is evaluated by the agent as being pointless in its particular context (i.e., deleting an appointment that does not exist), negative reinforcement occurs. A resulting learning curve (Fig 3) indicates rapid learning by the agent.

In addition to providing rapid learning, the Q-learning technique also provides flexibility. A user may easily change or move between different sets of sketch annotations by simply verbally telling her agent what they signify.

Fig. 3 Agent learning curve.
The x-axis shows the number of annotations made
by the user, while the y-axis shows the percentage
of correct guesses made by the agent. After only
10 annotations the agent's hit rate jumps to 60%,
showing rapid learning.

Fig. 4. Examples of possible sketch annotations.

CONCLUSION

This experiment successfully demonstrates the use of an intelligent interface agent that learns a user's personal sketch annotation using integrated sketch and speech input. Implementation of a Q-learning reinforcement technique is shown to provide rapid and flexible learning. Informal experiments among my colleagues supports my hypothesis that multi-modal interaction combining speech and sketch is natural and intuitive for the user; however, future work should include user testing and evaluation of the speech-dominant approach to learning personal sketch annotation.

Acknowledgments

This research is being conducted at the Visible Language Workshop of the MIT Media Laboratory. I would like to thank my colleagues for their support and inspiration.

This work was sponsored in part by ARPA, NYNEX, Alenia, and JNIDS. Software used in this implementation included Dean Rubine's Single-Stroke Gesture Recognition software and IBM Continuous Speech Software.

References

[1] Arnheim, Rudolf. Sketching and the Psychology of Design. Design Issues: Vol. IX, Number 2, Spring 1993, pp. 15-19.

[2] Bolt, Richard A. Put-That-There: Voice and Gesture at the Graphics Interface. Proceedings of SIGGRPAH 1980, ACM/SIGGRAPH, NY, 1980, pp. 262-270.

[3] Gross, M. D. Indexing visual databases of designs with diagrams. Proceedings of Conference on Design Decision Support Systems II Symposium on Visual Databases in Architecture, Vaals, Netherlands, 1994.

[4] Koons, David B and Carlton J. Sparrell. ICONIC: Speech and Depictive Gestures at the Human- Machine Interface. Proceedings of CHI94, ACM, Boston, MA, 1994.

[5] Rubine, Dean. The Automatic Recognition of Gestures. PhD Dissertation, Carnegie-Mellon University, 1991.

[6] Schmandt, Christopher. Voice Communication with Computers: Conversational Systems. Van Nostrand Reinhold: New York, 1994.

[7] Sutton, Richard S. "Reinforcement Learning Architecture for Animats," Animals to Animats, edited by Jean-Arcady Meyer and Stewart W. Wilson. 1991.

Mark Your Calendar! Learning Personalized Annotation from Integrated Sketch and Speech