



Donald G. Kimber, Lynn D. Wilcox, Francine R. Chen, and Thomas P. Moran
Audio recording is an easy way to capture the content of meetings,
group discussions, or conversations. However the sequential nature of
the media makes good indexing essential to the effective use of the
recorded audio. One kind of index is speaker identity. We describe a
system which automatically divides a multi-speaker recording into
speaker segments and displays this information graphically. The tool
allows a user to easily access the parts of a recording where given
people are talking.
It is difficult to find specific information in audio recordings because
it is necessary to listen sequentially.
Techniques exist for time compression of audio without
pitch distortion
[Arons92]
,
but speech is incomprehensible if
played faster than about twice real time. Although it is possible to fast
forward or skip around, it is difficult to know exactly where to stop
and listen. For this reason, effective audio browsing requires the
use of indices providing some structure to the recording.
Arons
[Arons93]
has developed a system that allows skimming of
speech by playing only sections following long pauses. Wilcox and
Bush
[Wilcox92]
use keyword spotting as a means of indexing
speech based on its content. Chen and Withgott
[Chen92]
identify
emphatic regions in speech. Hindus et al.
[Hindus93]
use speaker change, as indicated by separate telephone and office
microphones, to index and display phone conversations.
In this paper we describe speaker-based indexing derived from
automatic speaker segmentation, for cases where multiple speakers are
recorded on a single audio track. While standard speaker
identification assumes that only a single speaker is present in an
audio track, the task of speaker segmentation is to segment the audio
into intervals, each containing only speech from a single speaker.
These intervals can be displayed graphically as a navigational aid in
browsing. Furthermore, indices based on these intervals provide the
capability to skip between speakers when reviewing audio data, to
play back only what a particular speaker says, etc. Pauses or silence
intervals, as well as non-speech sounds such as laughter, can also be
segmented for use in indexing.
Segmentation of the audio according to speaker is performed using a
network of hidden Markov models (HMMs)
[Wilcox94]
. Each speaker
is modeled using an HMM consisting of states corresponding to the
acoustic patterns produced by the speaker; no phonetic knowledge is
used. The conversation is modeled by a parallel network of
speaker HMMs, in which any speaker may follow any other. Speaker
segmentation is performed using the Viterbi algorithm
[Rabiner93]
to find the most likely sequence of states, and noting the times
when the optimal state sequence changes between speaker HMMs. In
addition to modeling speakers, HMMs can also be used to model silence
and non-speech sounds such as laughter.
In situations where the speakers are known a priori , and
where it is possible to obtain sample data from their speech, segmentation
of the audio into regions corresponding to the known speakers can be
performed in real-time, as the speech is being recorded. This is done
by pre-training the speaker HMMs using the sample data, and then using
a real-time Viterbi algorithm for segmentation. Pre-training is done
using the Baum-Welch training algorithm
[Rabiner93]
to estimate
the parameters of the speaker HMM, with about a minute of speech data
from the speaker.
When no prior knowledge of the speakers is available, unsupervised
speaker segmentation is possible using a non-real-time, iterative
algorithm. Parameters for the speaker HMMs are first initialized
using an agglomerative clustering procedure
[Gish91]
[Wilcox94]
, and iteratively improved by using the Viterbi
algorithm to compute a segmentation, and then retraining the speaker
HMMs based on that segmentation. The segmentation that results after
convergence of the HMM parameters is then used for the speaker index.
In a previous experiment
[Wilcox94]
using the above techniques,
we reported that we could segment speakers to about 95% accuracy for
a five-person SIGGRAPH panel discussion, which had a professional
audio setup. Recently, we have found the automatic segmentation to be
effective on recorded meetings involving 5 to 10 people, where there
was a more casual recording setup, including a 7 person lunchtime
discussion recorded on a microcassette recorder.
We implemented a graphical browsing tool to give the listener access
to the speaker segment indices. (See
Figure 1
.) The tool runs on Sun
workstations, and was implemented with the Tk interface in the Python
language. The browser contains two regions with a
timeline representation of a recording -- an overview region
displaying the entire recording and (beneath it) a larger, scrollable
``detail'' region. The overview region contains an adjustable marker to
control scrolling and zooming within the detail region. Both regions
contain adjustable playpoint markers to display and set the current
playback point in the recording. Navigation buttons (on the bottom
left) move the playpoint to the beginning of the next or previous
speaker interval.
FIGURE 1: Graphical Browsing Tool.
The detail region has ``tracks'' for each speaker or category of
sound. The time segments when a particular person is talking are
indicated by colored bands on the track representing that person. A
mouse click on a band causes the segment it represents to be played.
Tracks can be collapsed or expanded as needed. Color is used to make
tracks visually distinct; it also permits speaker segments to be
distinguished when different speaker tracks are collapsed into a
single track. Speaker names and their associated colors are shown as
buttons (lower right). Left/right mouse button clicks on a speaker
button (e.g. Don in Figure 1) cause the previous/next segment for
that speaker to be played.
The tool also allows the user to add new segments, and to adjust or
delete existing segments. This is useful for creating initial labeled
segments to use for training the speaker models, and for correcting
mistakes made by the automatic segmentation algorithm. Segments and
tracks can be used to represent more than speaker identity. For
example, while listening to a recording, a user can create a track to
mark the portions of a conversation which are deemed relevant to a
certain topic (cf.
[Weber94]).
Future work will concentrate on improved segmentation accuracy using
multiple microphones, incorporation of additional indexing information
into the audio browser, and integration with other capture, playback,
and analysis tools.
Arons, B.(1992).
Techniques, perception, and application
of time-compressed speech, Proc. AVIOS'92, 169-177.
Arons, B. (1993).
SpeechSkimmer: Interactively skimming
recorded speech, Proc. UIST'93, 187-196.
Wilcox, L.D., I. Smith and M.A. Bush. (1992).
Wordspotting for voice editing and audio indexing, Proc. CHI'92,
655-656.
Chen, F.R. and M.M. Withgott. (1992).
The use of emphasis to automatically summarize a spoken
discourse, Proc. ICASSP'92,
229-232.
Hindus, D., C. Schmandt, and C. Horner. (1993).
Capturing, Structuring, and representing ubiquitous audio, ACM
Transactions on Information Systems 11, 1993, 376-400.
Weber, K. and A. Poon. (1994).
Marquee: A Tool for real-time video logging,
Proc. CHI'94, 58-64.
Wilcox, L.D., F.R. Chen, D. Kimber and V. Balasubramanian. (1994).
Segmentation of speech using speaker identification, Proc.
ICASSP'94, 161-164.
Rabiner, L.R. and Juang, B. (1993).
Fundamentals of Speech Recognition. Prentice-Hall
Gish, H., M.-H. Siu, and R. Rohlicek. (1991).
Segregation of speakers
for speech recognition and speaker identification,
Proc. ICASSP'91, 873-876.
Abstract
Introduction
SPEAKER SEGMENTATION
GRAPHICAL BROWSING TOOL
FUTURE WORK
Acknowledgments
We would like to thank Polle Zellweger and other colleagues at PARC
for many helpful discussions. We also thank Marcia Bush and Jan Pedersen
for supporting this work.
References