ACM Multimedia 95 - Electronic Proceedings
November 5-9, 1995
San Francisco, California

MPI-Video Prototype Systems

Patrick H. Kelly
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
phkelly@vision.ucsd.edu
http://vision.ucsd.edu/~phkelly/

Arun Katkere
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
katkere@vision.ucsd.edu
http://vision.ucsd.edu/~katkere/

Don Y. Kuramura
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
kuramura@vision.ucsd.edu
http://vision.ucsd.edu/~kuramura/

Saied Moezzi
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
moezzi@vision.ucsd.edu
http://vision.ucsd.edu/~moezzi/

Shankar Chatterjee
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
shankar@vision.ucsd.edu
http://vision.ucsd.edu/~shankar/

Ramesh Jain
Visual Computing Laboratory
University of California, San Diego
9500 Gilman Drive, Mail Code 0407
La Jolla, CA 92093-0407
jain@vision.ucsd.edu
http://vision.ucsd.edu/~jain/


ACM Copyright Notice


Abstract:

Management of video information is a demanding but ever more prevalent aspect of multimedia computing systems. In response to the staggering amounts of data that video entails, coupled with its great utility, current and next generation multimedia systems dealing with video data are being called upon to perform ever more demanding services. For instance, the automatic and semi-automatic analysis and annotation of the video data. Our Multiple Perspective Interactive Video (MPI-Video) project integrates a variety of visual computing operations with modeling and interaction techniques, to extract, synthesize and manage dynamic representations of scenes observed from multiple perspectives. Automatic and semi-automatic analysis of video data from multiple cameras is performed. This analysis is used to build a three-dimensional model of the environment monitored by the cameras. MPI-Video has applications in a variety of areas including the development of immersive video for telepresence systems, traffic monitoring and control and the analysis of physical performances, e.g. sports and dance. This video presents an overview of MPI-Video along with demonstrations of two MPI-Video prototypes developed in the Visual Computing Lab.


Video Summary

Multiple Perspective Interactive Video (MPI-Video) is an infrastructure for the analysis and management of multiple streams of video data. Typically, these video streams come from a set of cameras simultaneously monitoring an environment. Our MPI-Video architecture consists of several components. Video streams are input to the Video Data Analyzer which performs image processing operations to identify and track objects recorded by the video. An information Assimilator module collects the results from the analysis performed on each of the video streams and integrates these into a comprehensive representation of the three-dimensional environment monitored by the cameras. This representation is encapsulated in the Environment Model. The system also provides a multimedia database, which is currently being developed in our lab. A View Selector module selects a ``best view'', defined below, from the available video frames. The Visualize and Virtual View Builder processing takes the video data along with information from the Environment Model and uses these to create virtual views of the environment. As with the concept of ``best view,'' virtual views are described shortly. A user interface supports selection and query activities. Further details on our MPI-Video architecture, including an architecture diagram, are provided by Kelly, et al. [2].

This video details two prototype systems which realize various aspects of this architecture. The first section of the video, presents an MPI-Video prototype which automatically identifies and tracks objects moving in an environment recorded by four cameras. This system implements versions of the Video Data Analyzer, Assimilator, Environment Model, View Selector, and Virtual View Builder along with a simplified user interface. Each camera ``sees'' the environment from a different perspective. In this prototype, image differencing techniques are used to identify and extract moving object data from streams of video data. This data is collected by the centralized information assimilation model and used to construct a three-dimensional model of object behavior and events in the environment. An interface is provided allowing users to interact with the three-dimensional model and the video data. The interface presents both the raw video data, annotated to indicate moving objects in the scene, along with the three dimensional model itself. Note, that users can view the model from a variety of perspectives which are not limited to the real camera views. For instance, panoramic and ``bird's eye'' views are both available.

In an alternate version of this interface, a viewer can select objects and areas in the scene using a cursor. A marker indicating the selection as seen from each camera perspective, is placed in each camera window. This sequence also introduces the notion of a ``best view.'' That is, a particular camera view for which some criteria, either specified by the system or a user, is optimal. For instance, the ``best view'' might be the vista in which a particular object of interest is largest. Here, the ``best view'' is that for which the distance along the line of sight from camera to selected position is minimum. When a location in the model is selected the system determines the line of sight distance and chooses that view, from the four available, for which this distance is shortest. We are currently investigating more complex formulations of ``best view'' criteria.

To perform the video analysis and modeling, our MPI-Video system uses information about the ``static'' world. Camera calibration maps related locations in the two-dimensional video to a fully three-dimensional representation of the world recorded by the cameras. The video demonstrates software developed to assist in this calibration. This information is also maintained by the Environment Model and utilized by the Assimilator, etc.

The use of MPI-Video to support an Immersive Video application is also shown. In Immersive Video visual information from multiple live video images of a real-world event is obtained and integrated to provide a photo-realistic rendition of the dynamic environment, thus supporting a sense of total immersion in the environment. Using the actual camera data, a virtual view from a location anywhere in the environment can be constructed by the Virtual View Building module. These views are created by mosaicing pixels from the video data. Several such ``virtual'' views are shown and an animated walk through of the courtyard environment illustrates a sequence of such views. (An MPEG-1 version of this walk through is located at http://vision.ucsd.edu:80/papers.) Immersive Video is described in greater detail in our technical reports [3,4].

The second portion of the video highlights an early MPI-Video prototype, named Fumble. In this prototype, we analyze football data from three cameras covering a Superbowl. Here, the system tracks different players on the field. This system also chooses a ``best view'' for a user selected player. While our MPI-Video system described above, provides fully automatic detection and tracking, this second system relies on manual analysis of the video data. Scene analysis was performed on key frames of the video and the results of this analysis stored in a database. The scene analysis consisted of identifying players and field marks in a frame. Such ``features'' are used by a camera calibration algorithm and to provide image to world coordinate mapping. During processing information produced by the analysis is accessed by the system to perform tracking and best view selection. An interpolative scheme is used to estimate player position and camera parameters between key frames during run time.

As indicated in the video, an interface allows users to query about players in the system and a three-dimensional cursor is used to select and locate players in the video data. The 3D cursor indicates a player's three-dimensional location in the scene. For instance, a user can place the cursor on a particular player in the video frame and ask ``Who is this?'' The system will then identify the player. Alternately, a specific player can be chosen from a list and the system will indicate, using the three-dimensional cursor, where the player is in the frame. As in the previous system, Fumble also provides a ``best view''. In this case, the frame which keeps a selected player of interest most central to the frame. A ``best view'' sequence illustrates this capability of the Fumble prototype. As indicated in the video, the system tracks the selected player of interest, choosing, at each time step, the ``best view.'' In this case, that frame in which the player is most central. Additional information on our Fumble prototype is presented in the account of this work by Jain and Wakimoto [1].


References

1
R. Jain and K. Wakimoto. Multiple perspective interactive video. In Proceedings of International Conference on Multimedia Computing and Systems, pages 202--211, May 1995.

2
P. Kelly, A. Katkere, D. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain. An Architecture for Multiple Perspective Interactive Video. Technical Report VCL-95-103, Visual Computing Laboratory, UCSD, March 1995.

3
S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain. Immersive Video. Technical Report VCL-95-104, Visual Computing Lab, UCSD, March 1995.

4
S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain. Visual Reality: Rendition of Live Events from Multi-Perspective Videos. Technical Report VCL-95-102, Visual Computing Lab, UCSD, March 1995.