Model-Based Visual Communication

Introduction

The concept behind model-based video coding (video compression) is that models of 3D objects require less information to transmit than images of those objects. For instance, a 3D computer graphics program depicting a moving ball (sphere) would, when starting up, set the properties of the ball, such as its radius, color, and the shininess of its surface. After that, the motion of the ball is specified by values of the 3D coordinates, x, y, and z, of the center of the ball at each instant in time. The data transfer (baud) rate required to transmit the motion (the center of the ball), assuming 10 frames/sec, is about 480 bits/sec (16 bits/value × 3 values/frame × 10 frames/sec). More sophisticated coding of the motion parameters would significantly reduce the bit rate even further. In comparison, if the sequence of images depicting the ball are coded using a conventional compression scheme like MPEG-1, about 1,200,000 bits/sec are required for reasonable picture quality at the same frame rate. While this is, of course, an extreme example, it serves to illustrate the large gains in coding efficiency to be obtained by model-based techniques.

Figure 1.  A computer graphics representation of a face as a 3D triangular mesh drawn:  (a) as a wireframe (each line is an edge of a triangle);  (b) as solid shapes with shading added;  (c) with texture mapping overlaid.

A model-based video coding system uses models of the 3D objects appearing in a video. Such a model of a human face is depicted in Fig. 1. Values for the parameters of these models and estimates of the 3D motion of modeled objects are obtained by analyzing the video. These parameter values and motion estimates are transmitted and a video display of the modeled objects and their motion is synthesized using 3D computer graphics.

The major advantages of model-based video coding are the low transmission and storage costs (because of the low bit rate) and the ability to use existing communications channels such as POTS (plain old telephone service), wireless (e.g., personal communications systems), and the internet. Many existing communications channels have low bandwidth (few bits/sec) and visual communication through these channels will be possible only if major gains in coding efficiency, such as those potentially achievable through model-based techniques, are obtained.

Another advantage of model-based coding is the fact that the model used at the receiver (decoder) is not necessarily the same model used at the transmitter (encoder). Hence, it is possible to appear differently at the receiver than at the transmitter. For example, one might appear nicely dressed and groomed at the receiver when one actually is unkempt. This feature allows avoiding a common objection to visual communication.

Model-based coding also has the advantage that the bit rate is nearly independent of the image resolution and quality. In other words, the number of model and motion parameter values (and the bit rate) does not change if the size of the computer-generated image at the receiver is, for example, increased from 320 × 240 to 640 × 480. In other video compression techniques, such as MPEG, the bit rate can increase significantly if the image resolution is increased.

There are important issues involved in model-based coding. The computer-generated images should look realistic enough for public acceptance. The encoding and decoding must be accomplished with inexpensive equipment (cameras and computers) and must be done in real time (at least 10 frames/sec with QCIF images). A significant technical challenge results from the fact that much of the third dimension (depth) is lost in a video, which is a sequence of 2D images, and it is difficult to reconstruct this depth information.

Much of this work concentrates on model-based coding of heads and faces because such techniques are most relevant for videotelephones and teleconferencing. A consequence of this decision is that, since 3D computer graphics techniques are most often used to represent man-made objects, developing a 3D model of a natural object like the human head and face is a difficult task.

This work also has applications in advanced human-computer interaction and speech recognition. For instance, a ``talking head'' can be synthesized by these methods which would speak to and interact with a computer user in computer games and multimedia programs. For speech recognition, the lip motion of a person can be observed in a video and coupled with conventional speech recognition to improve recognition accuracy.

Here are some results.

Publications and Patents Filed