Speaker Segmentation for Browsing Recorded Audio

Donald G. Kimber, Lynn D. Wilcox, Francine R. Chen, and Thomas P. Moran

Xerox Palo Alto Research Center
3333 Coyote Hill Road, Palo Alto California 94304

(kimber,wilcox,fchen,moran)@parc.xerox.com

Keywords:

Multi-media, Auditory I/O, Speaker Segmentation, Speaker Identification, Audio Indexing, Browsing.

Abstract

Audio recording is an easy way to capture the content of meetings, group discussions, or conversations. However the sequential nature of the media makes good indexing essential to the effective use of the recorded audio. One kind of index is speaker identity. We describe a system which automatically divides a multi-speaker recording into speaker segments and displays this information graphically. The tool allows a user to easily access the parts of a recording where given people are talking.

Introduction

It is difficult to find specific information in audio recordings because it is necessary to listen sequentially. Techniques exist for time compression of audio without pitch distortion [Arons92] , but speech is incomprehensible if played faster than about twice real time. Although it is possible to fast forward or skip around, it is difficult to know exactly where to stop and listen. For this reason, effective audio browsing requires the use of indices providing some structure to the recording. Arons [Arons93] has developed a system that allows skimming of speech by playing only sections following long pauses. Wilcox and Bush [Wilcox92] use keyword spotting as a means of indexing speech based on its content. Chen and Withgott [Chen92] identify emphatic regions in speech. Hindus et al. [Hindus93] use speaker change, as indicated by separate telephone and office microphones, to index and display phone conversations.

In this paper we describe speaker-based indexing derived from automatic speaker segmentation, for cases where multiple speakers are recorded on a single audio track. While standard speaker identification assumes that only a single speaker is present in an audio track, the task of speaker segmentation is to segment the audio into intervals, each containing only speech from a single speaker. These intervals can be displayed graphically as a navigational aid in browsing. Furthermore, indices based on these intervals provide the capability to skip between speakers when reviewing audio data, to play back only what a particular speaker says, etc. Pauses or silence intervals, as well as non-speech sounds such as laughter, can also be segmented for use in indexing.

SPEAKER SEGMENTATION

Segmentation of the audio according to speaker is performed using a network of hidden Markov models (HMMs) [Wilcox94] . Each speaker is modeled using an HMM consisting of states corresponding to the acoustic patterns produced by the speaker; no phonetic knowledge is used. The conversation is modeled by a parallel network of speaker HMMs, in which any speaker may follow any other. Speaker segmentation is performed using the Viterbi algorithm [Rabiner93] to find the most likely sequence of states, and noting the times when the optimal state sequence changes between speaker HMMs. In addition to modeling speakers, HMMs can also be used to model silence and non-speech sounds such as laughter.

In situations where the speakers are known a priori , and where it is possible to obtain sample data from their speech, segmentation of the audio into regions corresponding to the known speakers can be performed in real-time, as the speech is being recorded. This is done by pre-training the speaker HMMs using the sample data, and then using a real-time Viterbi algorithm for segmentation. Pre-training is done using the Baum-Welch training algorithm [Rabiner93] to estimate the parameters of the speaker HMM, with about a minute of speech data from the speaker.

When no prior knowledge of the speakers is available, unsupervised speaker segmentation is possible using a non-real-time, iterative algorithm. Parameters for the speaker HMMs are first initialized using an agglomerative clustering procedure [Gish91] [Wilcox94] , and iteratively improved by using the Viterbi algorithm to compute a segmentation, and then retraining the speaker HMMs based on that segmentation. The segmentation that results after convergence of the HMM parameters is then used for the speaker index.

In a previous experiment [Wilcox94] using the above techniques, we reported that we could segment speakers to about 95% accuracy for a five-person SIGGRAPH panel discussion, which had a professional audio setup. Recently, we have found the automatic segmentation to be effective on recorded meetings involving 5 to 10 people, where there was a more casual recording setup, including a 7 person lunchtime discussion recorded on a microcassette recorder.

GRAPHICAL BROWSING TOOL

We implemented a graphical browsing tool to give the listener access to the speaker segment indices. (See Figure 1 .) The tool runs on Sun workstations, and was implemented with the Tk interface in the Python language. The browser contains two regions with a timeline representation of a recording -- an overview region displaying the entire recording and (beneath it) a larger, scrollable ``detail'' region. The overview region contains an adjustable marker to control scrolling and zooming within the detail region. Both regions contain adjustable playpoint markers to display and set the current playback point in the recording. Navigation buttons (on the bottom left) move the playpoint to the beginning of the next or previous speaker interval.

FIGURE 1: Graphical Browsing Tool.

The detail region has ``tracks'' for each speaker or category of sound. The time segments when a particular person is talking are indicated by colored bands on the track representing that person. A mouse click on a band causes the segment it represents to be played. Tracks can be collapsed or expanded as needed. Color is used to make tracks visually distinct; it also permits speaker segments to be distinguished when different speaker tracks are collapsed into a single track. Speaker names and their associated colors are shown as buttons (lower right). Left/right mouse button clicks on a speaker button (e.g. Don in Figure 1) cause the previous/next segment for that speaker to be played.

The tool also allows the user to add new segments, and to adjust or delete existing segments. This is useful for creating initial labeled segments to use for training the speaker models, and for correcting mistakes made by the automatic segmentation algorithm. Segments and tracks can be used to represent more than speaker identity. For example, while listening to a recording, a user can create a track to mark the portions of a conversation which are deemed relevant to a certain topic (cf. [Weber94]).

FUTURE WORK

Future work will concentrate on improved segmentation accuracy using multiple microphones, incorporation of additional indexing information into the audio browser, and integration with other capture, playback, and analysis tools.

Acknowledgments

We would like to thank Polle Zellweger and other colleagues at PARC for many helpful discussions. We also thank Marcia Bush and Jan Pedersen for supporting this work.

References

Arons, B.(1992). Techniques, perception, and application of time-compressed speech, Proc. AVIOS'92, 169-177.

Arons, B. (1993). SpeechSkimmer: Interactively skimming recorded speech, Proc. UIST'93, 187-196.

Wilcox, L.D., I. Smith and M.A. Bush. (1992). Wordspotting for voice editing and audio indexing, Proc. CHI'92, 655-656.

Chen, F.R. and M.M. Withgott. (1992). The use of emphasis to automatically summarize a spoken discourse, Proc. ICASSP'92, 229-232.

Hindus, D., C. Schmandt, and C. Horner. (1993). Capturing, Structuring, and representing ubiquitous audio, ACM Transactions on Information Systems 11, 1993, 376-400.

Weber, K. and A. Poon. (1994). Marquee: A Tool for real-time video logging, Proc. CHI'94, 58-64.

Wilcox, L.D., F.R. Chen, D. Kimber and V. Balasubramanian. (1994). Segmentation of speech using speaker identification, Proc. ICASSP'94, 161-164.

Rabiner, L.R. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice-Hall

Gish, H., M.-H. Siu, and R. Rohlicek. (1991). Segregation of speakers for speech recognition and speaker identification, Proc. ICASSP'91, 873-876.