User-Centered Video: Transmitting Video Images Based on the User's Interest

Kimiya Yamaashi, Yukihiro Kawamata, Masayuki Tani, Hidekazu Matsumoto

Hitachi Research Laboratory, Hitachi, Ltd.
7-1-1 Omika, Hitachi, Ibaraki, 319-12 Japan
Phone: +81-294-52-5111
E-mail: yamaashi@hrl.hitachi.co.jp

Abstract

Many applications, such as video conference systems and remotely controlled systems, need to transmit multiple video images through narrow band networks. However, high quality transmission of the video images is not possible within the network bandwidth. This paper describes a technique, User-Centered Video (UCV), which transmits multiple video images through a network by changing quality of the video images based on a user's interest. The UCV assigns a network data rate to each video image in proportion to the user's interest. The UCV transmits video images of interest with high quality, while degrading the remaining video images. The video images are degraded in the space and time domains (e.g., spatial resolution, frame rate) to fit them into the assigned data rates. The UCV evaluates the degree of the user's interest based on the window layouts. The user thereby obtains both the video images of interest, in detail, and the global context of video images, even through a narrow band network.

Keywords:

Networks or communication, Digital video, Compression, User's interest, Computing resources

Introduction

Most image compression techniques, such as H.261 [1], JPEG (Joint Photographic Experts Group) [2] and MPEG (Moving Pictures Experts Group) [3], are not user-centered. These techniques compress a video image in the same way without considering what a user is watching in the video image. Furthermore they cannot allow transmission of multiple video images through a narrow band network such as a public network like ISDN (Integrated Services Digital Network). For example, the MPEG technique allows only one video image to be transmitted through the ISDN-1500 (1.5 Mbits/second) network [3]. Since the technique degrades all the video images uniformly to fit them into the network, a user cannot obtain video images of individual desired quality.

Many applications, such as video conference systems, remotely controlled systems and security systems, rely on transmission of multiple video images through public networks. For example, users of video conference systems want to look at multiple video images showing all participants in the conference at one time. Operators of remotely controlled systems always look at multiple video images of remote spots because problems may occur anywhere. In these applications, each site is a long way from the others. In video conference systems, sites may be several hundred miles from each other. In remotely controlled systems, remote spots may be several miles from the control room. These systems need to use a public network to transmit multiple video images because it is very expensive to build a private network between such remote locations.

Users are not interested in all the video images at a time. They are watching a part of the video images carefully and seeing the remaining video images peripherally to obtain a global context of the situation. For example, when an operator in a remotely controlled system is inspecting a machine, the operator is carefully watching the video image that shows the machine. The operator sees peripherally the remaining video images only to grasp whether problems are occurring there or not.

This paper describes a technique for transmitting multiple video images, based on a user's interest, through a network. The technique, User-Centered Video (UCV), compresses video images of interest with high quality and degrades the remaining video images according to the user's interest. That is to say, the UCV assigns a larger part of a network data rate to a video image of interest than the remaining video images. The UCV transmits multiple video images through the following steps:

1) The UCV evaluates a user's interest dynamically based on window layouts.
2) The UCV degrades video images according to the evaluated user's interest by changing the frame rate and the number of pixels of video images to fit them into the network bandwidth.

There are several approaches to control display [4-6] or image compression [7,8] according to a user's interest. For example the Generalized Fisheye Views approach enables a user to recognize a long program list by displaying the part of interest in detail, while giving the whole list in a global context [4]. These techniques control only screen space among the computing resources. The UCV assigns network data rates to multiple video images according to a user's interest. The image query system developed by Hill et al. [7] transmits a rough image of the whole image first and then transmits a detailed image of the region specified by the user. But, they discuss only transmission of a static image. The UCV transmits multiple continuous media (video images) through a network. Plompen et al. [8] proposed a video conference system that tracks movement of participants' heads. They assumed that participants in one site are interested in the movement of participants' heads in the other site. As they determined a user's interest statically, the technique cannot adapt to dynamic changes of the user's interest. The UCV evaluates the degree of the user's interest dynamically so that it can adapt to the dynamic changes.

In this paper, first we describe the concept of the UCV. Then we describe the UCV in detail; the architecture, evaluation of the degree of a user's interest, and determination of video parameters (frame rate and number of pixels). Finally we demonstrate the UCV with a prototype system.

CONCEPT OF USER-CENTERED VIDEO

A typical application of the UCV is to a remotely controlled system, such as a public water system shown in FIGURE 1. The system locates several video cameras around an intake tower to watch conditions (e.g., at the floodgate for the intake, on the river, at gates to the water channel, etc.). The system usually has several intake towers that are several miles from the control room, so it is cost- desirable to transmit multiple images around each intake tower through a public network.

FIGURE 1 No caption given.

The MPEG technique allows only one video image at 30 frames/second through the ISDN-1500 network. So when the system transmits 8 video images, MPEG transmits all the video images at only 3-4 frames/second. It is hard for an operator to control and watch the floodgate because smooth motion of the floodgate is lacking with such a degraded video image.

The system does not need to transmit the video images with the same quality. During opening of a floodgate for intake the operator is focusing on the video image of the floodgate to examine the exact position and motion of the floodgate. The operator needs to obtain the video image of the floodgate with high quality. At the same time the operator looks at global context of the remaining video images rather than the details. For example, the operator is seeing peripherally the video images of gates that are not being handled. Because rough images of the gates allow the operator to grasp whether the gates are still or not, the operator does not need details such as the exact positions of the gates.

The user's interest changes dynamically according to the situation. For example, when some problems begin to occur in a video image, the operator changes the user's interest to the video image in which the trouble is occurring. The UCV transmits video images of interest with high quality by degrading the remaining video images, because it is useless transmitting video images of little interest with the same quality as video images of interest. The UCV evaluates the degrees of the user's interest of video images dynamically. The UCV assigns a network data rate to each video image according to the evaluated degree of the user's interest. The UCV changes quality of each video image to fit them into the assigned data rate.

USER-CENTERED VIDEO

Architecture of UCV

A schematic diagram of the UCV is shown in FIGURE 2. Only the receiver gets the user's conditions. In the UCV the transmitter not only sends video images to the receiver, but also receives information about the user's interest from the receiver. The UCV adds two steps to the conventional video transmitting steps: (1) dynamic evaluation of the degree of the user's interest; and (2) determination of video parameters (frame rate and number of pixels of each video image) of each video image.

FIGURE 2 No caption given.

The UCV can also use a conventional video compression technique (JPEG, MPEG and H.261) in the encode and decode steps in FIGURE 2. The UCV compresses video images more efficiently than conventional video compression techniques.

Evaluating user's interest

The UCV evaluates a user's interest based on window layouts. For example, when a user compares two video windows, the user locates these video windows close to each other to look at both at one glance. We can estimate the user's interest based on the distance between windows. The degree of a user's interest decreases with the distance from a focused window as shown in FIGURE 3. When the user is focusing on a region, user's visions are clear at the center of the view and fuzzy in the surrounding regions. This is because the retina of a human eye is hierarchically decomposed into a foveal region that perceives details and a surrounding low resolution region [5].

FIGURE 3 No caption given.

In this paper the UCV formalizes DOI (Degree Of Interest) of each video window as the following function:

DOI(x) = exp (-d)

where exp() is an exponential function and d is the distance between a window x and the focused window. We define DOI of the focused window as one (maximum). DOI of a video window is equal to 0 when the distance from the focused window is infinite.

In the UCV a user specifies the focused window explicitly like an active window. There are many ways to detect the focused window. For example an eye tracker can detect the spot on which the user is focusing. We adopt a simple way such as the user's explicit specification because most users dislike wearing special devices such as glasses for an eye tracker.

Determining video parameters of each video image The UCV assigns a network data rate to each video image according to the evaluated user's interest and determines video parameters of each video image to fit it into the assigned network data rate.

The UCV determines a network data rate in proportion to the evaluated user's interest. The UCV calculates the required network data rate to transmit each video image. If the sum of the required network data rate is equal to or less than the network bandwidth, the UCV assigns the required network data rate to each video image. If the sum is greater than the network bandwidth, the UCV divides the bandwidth into the network data rates for video images in proportion to the required network data rate multiplied by the DOI of each video image.

The UCV determines the frame rate and the number of pixels of each video image based on the assigned network data rate. The data amount of a video image is expressed by the mathematical product of the frame rate and the number of pixels. The assigned network data rate determines only the value of the mathematical product. The frame rate and the number of pixels are traded-off. There are many ways to determine the frame rate and the number of pixels. In this paper the UCV calculates a temporary frame rate based on the assigned network data rate assuming that the size of each video image is equal to the displayed size of the video image. The UCV then decreases the number of pixels by decreasing the number of rows and columns sequentially and it increases the frame rate until the frame rate becomes more than the minimum frame rate that is specified in advance.

UCV IN A VIDEO IMAGE

Specification of a region of interest

The UCV transmits a video image of interest with higher quality than without the UCV. When a user wants to look at the video image of interest with higher quality, the UCV allows the user to obtain the regions of interest in the video image with higher quality by degrading the remaining region.

There are regions of interest and regions of no interest in a video image. For example, during opening of the floodgate, an operator is carefully watching the region of the floodgate, rather than the other places, such as the background wall. The operator wants to look at the region of the floodgate with higher quality even if the remaining region is degraded.

In the UCV a user can create, delete and move a rectangular region and change the resolution of the specified region as shown in FIGURE 4. The UCV transmits the specified region of interest with higher quality by degrading the remaining region (the background region).

FIGURE 4 No caption given.

Video parameters of each region

The UCV changes the resolution of each region according to the user's specified resolution of each region. The UCV assigns the required network data rate to transmit each region with the specified resolution, to the region from the assigned network data rate. The UCV then degrades the background region to fit it into the remaining data rate of the assigned network data rate.

We might try to change the frame rate to change the data amount of each region, but if the frame rates of regions are changed, seams might appear at the edge of each region because the digitizing time of each region is different. The UCV does not change the frame rate of regions.

EXAMPLES WITH A PROTOTYPE SYSTEM

Example for ISDN-1500

We developed a prototype system that simulates views of video images transmitted by the UCV. The system digitizes and displays multiple video images while changing the frame rate and the number of pixels of each video image so that the total data amount of the video images is equal to the assumed bandwidth. This prototype system can simulate views of video images through any network by changing the bandwidth.

We assumed that the prototype system's bandwidth was 23 Mbits/second, which can display a video image (320 columns and 240 rows, 24 bits/pixels) at 13 frames/second. The views of the system correspond to views of video images transmitted with JPEG through ISDN-1500 networks, because JPEG can compress a video image at the compression rate of 1.5-2.0 bits/pixel with indistinguishable quality from the original [2].

COLOR PLATE 1 COLOR PLATE 1. YAMAASHI_PLATE 1 shows a view of 4 video images (displayed size is 320 columns and 240 rows) with the UCV. VIDEO SEGMENT 1 is the focused window designated FO. The title bar of each video window shows the name and the frame rate and the zoom rate of columns and rows. The frame rate was adjusted to more than 5 frames/second in determining the frame rate and the number of pixels. The zoom rate is the rate of the displayed size and the digitized size of a video image. For example, the zoom rate of a column of 3 means that a digitized pixel is magnified to 3 pixels in the column direction when it is displayed. Returning to our example, we see that an operator can inspect details of a floodgate with smooth motion. Without the UCV the operator can see all video images at only the average frame rate (3 frames/second). This frame rate is too low to observe effectively the motion of the floodgate. The UCV can show the focused video image of the floodgate with about three times smoother motion (8 frames/second). The assigned ratio of the network bandwidth to the focused video images is 59 %.

DYNAMIC FIGURE 1: No caption given. (QuickTime Movie, about 10 mb)

The UCV degrades the remaining video images according to the distance from the focused video image. For example VIDEO SEGMENT 3 is displayed with the zoom rate of 2x2 at 8 frames/second. The assigned ratio of the network bandwidth for VIDEO SEGMENT 3 is 16 %. VIDEO SEGMENT 2, which is the farthest from the focused window, is displayed with the zoom rate of 3x3 at 6 frames/second. The assigned ratio of the network bandwidth for VIDEO SEGMENT 2 is only 5 %. The operator can obtain the global motion of video images even when they are degraded. For example, VIDEO SEGMENT 2 shows carp in the water quality monitoring tank. The abnormal motion of the carp means that the water is polluted with something. The operator needs to examine the carp motion to grasp whether the water quality is normal or not. VIDEO SEGMENT 2 does not show the carp in detail (e.g., the pattern of carp's bodies), but the global context which allows the operator to judge whether the carp are swimming normally or not.

This example shows that the UCV allows an operator to obtain a smoother video image of interest with a global context of video images.

Example for ISDN-64

We simulated a video transmission with the ISDN-64 network. We assumed the bandwidth of the prototype system was 1.9 Mbits/second, which can transmit a video image (320 columns and 240 rows, 24 bits/pixels) at 1.1 frames/second. The ISDN-64 network has 2 data lines (64 Kbits/second) and 1 control line (16 Kbits/second). We can use 2 data lines to transmit video images. The network bandwidth is 128 Kbits/second. The views of this prototype system correspond to views of video images with JPEG through the ISDN-64 network.

COLOR PLATE 2 COLOR PLATE 2

YAMAASHI_PLATE 2 sho.ws a view of video images with the UCV. YAMAASHI_PLATE 3 shows a magnified image of the focused video image. An operator can understand the global motion of each video image, but the network bandwidth is too low to examine the focused video image in detail. The zooming rate of the focused video image is 3x3. The operator can not read the water level with the numbers on the water gauge to inspect the water level.

COLOR PLATE 3 COLOR PLATE 3.

In YAMAASHI_PLATE 3 the operator specifies a region of the water gauge as a region of interest, then the UCV shows the video image of YAMAASHI_PLATE 4. The specified region of interest is shown with the red rectangle. The UCV shows the region of interest (80 columns and 45 rows) at full resolution (zooming rate 1x1), while the resolution of the background is 1.8 times rougher than YAMAASHI_PLATE 3. In YAMAASHI_PLATE 4 the operator can read the numbers on the water gauge. This example shows that the UCV allows an operator to look at numbers on a gauge by specifying the region as a region of interest, while the numbers cannot be seen without the UCV.

COLOR PLATE 4 COLOR PLATE 4.

CONCLUSION

We have proposed a User-Centered Video (UCV) that transmits multiple video images with a narrow band network. The UCV assigns network data rates to video images and regions of a video image according to the user's interest. The UCV evaluates the degree of the user's interest from window layouts (distance from the focused window) and direct specification of regions of interest in a video image.

We demonstrated the UCV using examples that simulated views of multiple video images with ISDN-1500 and ISDN-64 networks. The example for the ISDN-1500 network demonstrated that the UCV allows a user to get the focused video with much smoother motion than without the UCV and the global context of video images is obtained even while the remaining video images are degraded. The example for the ISDN-64 network illustrated that the UCV allows a user to look at the details of regions of interest by degrading the remaining regions, although the user cannot look at the details without the UCV.

We assigned network data rates to video images according to the user's interest. In the future, we would like to assign computing resources also according to user's interest. For example, graphic power of three dimensional (3D) graphic hardware is a limited computing resource. When a user displays multiple 3D graphics, the required graphic power overwhelms the power of the graphic hardware. It is desirable to assign the graphic power according to the user's interest.

References

1. Liou, M. Overview of the px64 kbits/s video coding standard, Communications of the ACM, 1991, ACM Press, Vol. 34, No. 4, pp. 59-63.
2. Wallace, G.K. The JPEG still picture compression standard, Communications of the ACM, 1991, ACM Press, Vol. 34, No. 4, pp. 30-44.
3. Gall, D.L. MPEG: A video compression standard for multimedia applications, Communications of the ACM, 1991, ACM Press, Vol. 34, No. 4, pp. 46-58.
4. Furnus, G.W. Generalized fisheye views, in Proc. CHI'86 Human Factors in Computing Systems (Boston, April 13-17, 1986), ACM Press, pp. 16-23.
5. Mackinlay, J.D. Robertson, G.G. Card, S.K. The perspective wall: detail and context smoothly integrated, in Proc. CHI'91 Human Factors in Computing Systems (New Orleans, April 27-May 2, 1991), ACM Press, pp. 173-179.
6. Stone, M.C. Fishkin, K. Bier, E.A. The movable filter as a user interface tool, in Proc. CHI'94 Human Factors in Computing Systems (Boston, April 24-28, 1994, ACM Press, pp. 306-312.
7. Hill Jr., F.S. Walker Jr., S. Gao, F. Interactive image query system using progressive transmission, Computer Graphics (July 1993), ACM Press, Vol. 17, No. 3, pp. 323-330.
8. Plompen, R.H.J.M. Groenveld, J.G.P. Booman, F. Boekee, D.E. An image knowledge based video codec for low bitrates, SPIE, Advances in Image Processing, 1987, Vol. 804, pp. 379-384.