Tutorials Program

T01 – Processing Web-Scale Multimedia Data

Malcolm Slaney, Edward Chang
Abstract: In the last few years we have received access to multimedia databases with billions of objects. The massive change in the amount of data available to researchers is changing the face of multimedia. In many domains, speech-recognition is most notable, people have observed that the best way to improve their algorithm’s performance is to add more data. Starting with hidden-Markov models (HMMs) and support-vector machines, people have applied ever greater amounts of data to their problems and been rewarded with new levels of performance.

Details: T01 – Processing Web-Scale Multimedia Data

Duration: half day

T02 – Advances in Multimedia Retrieval

Alan Hanjalic, Martha Larson, Cees Snoek, Arnold Smeulders
Abstract: Multimedia that cannot be found is, in a certain sense, useless. It is lost in a huge collection, or worse, in a back alley of the Internet, never viewed and impossible to reuse. Research in multimedia retrieval is directed at developing techniques that bring video together with users – matching multimedia content and user needs. The aim of this tutorial is to provide insights into the most recent developments in the field of multimedia retrieval and to identify the issues and bottlenecks that could determine the directions of research focus for the coming years. This tutorial targets new scientists in the field of multimedia retrieval, providing instruction on how to best approach the multimedia retrieval problem and examples of promising research directions to work on. It is also designed to benefit active multimedia retrieval scientists — those who are searching for new challenges or re-orientation. The material covered is relevant for participants from both academia and industry. It covers issues pertaining to the development of modern multimedia retrieval systems and highlights emerging challenges and techniques anticipated to be important for the future of multimedia retrieval.

Part 1: Frontiers in Multimedia Search (Alan Hanjalic, Martha Larson)

In this part of tutorial we present a whirlwind tour of hot new multimedia search techniques, covering strategies that exploit users, the collection as a whole and analysis of individual content items. We focus on improving the overall usefulness of search systems. Discussion of information retrieval and speech/language-based techniques is included.

Part 2: Video Search Engines (Cees Snoek, Arnold Smeulders)

In this part of the tutorial we focus on the challenges in video search, present methods how to achieve state-of-the-art performance, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition – the leading competition for video search engines run by NIST – where we have achieved consistent top performance over the years, including the 2008 and 2009 editions.

Details: T02 – Advances in Multimedia Retrieval

Duration: full day

T03 – Understanding Multimedia Content Using Web Scale Social Media Data

Dong Xu , Lei Zhang, Jiebo Luo
Abstract: Nowadays, increasingly rich and massive social media data (such as texts, images, audios, videos, blogs, and so on) are being posted to the web, including social networking websites (e.g., MySpace, Facebook), photo and video sharing websites (e.g., Flickr, YouTube), and photo forums (e.g., Photosig.com and Photo.net). Recently, researchers from multidisciplinary areas have proposed to use data-driven approaches for multimedia content understanding by leveraging such unlimited web images and videos as well as their associated rich contextual information (e.g., tag, comments, category, title and metadata). In our three hour tutorial, we plan to introduce the important general concepts and themes of this timely topic. We will also review and summarize the recent multimedia content analysis methods using web-scale social media data as well as present insight into the challenges and future directions in this area. Moreover, we will also show extensive demos on image annotation and retrieval by using rich social media data.

Details: T03 – Understanding Multimedia Content Using Web Scale Social Media Data

Duration: half day

T04 – The Role of Media Semiotics to Facilitate Media Understanding

Frank Nack
Abstract: One current goal of multimedia research is to make multimedia information pervasively accessible and useable. An essential question to achieve that goal is under which circumstances a particular medium or a mix of media will serve a particular communication need more effectively than others. This question becomes in particular relevant for research in social media and mobile media, as both provide the environments that establish contexts allowing, for the first time, to observe social behavior on a large scale and in real time.

One promising approach to interpreting this data, and as a result facilitate the building of tools that are needed to support communication between humans as well as between humans and their living environment more effectively, is founded in understanding the semantics of various media within a computationally informed and systematic study of media production and reception. The purpose of this tutorial is to provide an understanding of the role and applicability of semiotics to facilitate the modelling of media semantics for various contexts. The tutorial addresses a number of specific issues:

Basic communication and semiotic theory
The applicability of media semiotics to capturing, representing, processing, managing, and personalizing media.

Details: T04 – The Role of Media Semiotics to Facilitate Media Understanding

Duration: half day

T05 – Mobile Video Streaming in Modern Wireless Networks

Mohamed Hefeeda, Cheng-Hsin Hsu
Abstract: Modern mobile devices, such as laptops, PDAs (Personal Digital Assistants), smart phones, and PMPs (Portable Media Players), have evolved to powerful mobile computers and can render rich multimedia content. Increasingly more users use mobile devices to watch videos streamed over wireless networks, and they demand more content at better quality. For example, market forecasts reveal that mobile video streaming, such as mobile TV, will catch up with gaming and music, and become the most popular application on mobile devices: more than 140 million subscribers worldwide by 2011. In this tutorial, we will present different approaches to deliver multimedia content over various wireless networks to many mobile users. We will study and analyze the main research problems in modern wireless networks that need to be addressed in order to enable efficient mobile multimedia services. The tutorial will cover common research problems in wireless networks such as HSDPA (High-Speed Downlink Packet Access), MBMS (Multimedia Broadcast Multicast Services) extension of cellular networks, WiMAX (Worldwide Interoperability for Microwave Access), LTE (Long Term Evolution), DVB-H (Digital Video Broadcasting – Handheld), and ATSC M/H (Advanced Television Systems Committee – Mobile/ Handheld). After giving the preliminaries of the considered wireless network standards, we will focus on several important research problems and present their solutions in details. These research problems include: (i) maximizing energy saving of mobile receivers, (ii) maximizing bandwidth utilization of wireless networks, (iii) minimizing stream switching time, and (iv) supporting heterogeneous mobile receivers. Finally, we will discuss open problems and future research directions in mobile multimedia.

Details: T05 – Mobile Video Streaming in Modern Wireless Networks

Duration: half day

T06 – Locality aware P2P delivery: the way to scale Internet Video

Jin Li
Abstract: The market of Internet video began its dramatic acceleration in 2006. The cumulative number of broadband enabled video devices is 160 million in 2009, and is projected to grow to almost a billion in 2014. Even social networking, which is seeing very strong growth with Facebook and other sites, probably cannot compare to the sheer numbers of new consumers that sign on broadband video. Moreover, a large portion of the videos viewed will be ad supported or free. Thus, there exists a need to distribute large amount of video cheaply to the end users. Exist data centers and CDN providers do not have the capacity and the cost structure to handle the surging demand of Internet video. In comparison, a peer assisted (P2P-CDN) solution use resource of the peers as they join in the service. As the demand of the system grows, the capacity of the network grows too. By building locality awareness into the peer assisted solution, we can retrieve popular content from close-by peers, thus relieve the congestion on the Internet backbone. The locality aware peer assisted solution is the only way to scale Internet Video to a worldwide audience. The purpose of the tutorial is to examine issues associated with the successful building and deployment of an efficient and reliable locality aware peer assisted content delivery solution. We start by examining some popular P2P applications, such as BitTorrent, Skype and PPLive. The study of these P2P applications helps us to understand the design principles of P2P applications in general. We then investigate the existing Internet backbone components, such as the data centers, the CDN providers, and the Internet architecture. We will see how a P2P network may effectively compliment the data centers and CDN providers. Finally, we will examine a number of tools for building an efficient and reliable P2P application, such as the overlay network, the scheduling algorithm, the erasure resilient coding, P2P economy, security issues, performance monitoring and debugging utilities in P2P application.

Details: T06 – Locality aware P2P delivery: the way to scale Internet Video

Duration: half day

T07 – A Systematic Framework for Cross-layer Optimization in Dynamic Multimedia Networks and Systems

Mihaela van der Schaar, Fangwen Fu
Abstract: Cross-layer optimization problems in dynamic multimedia systems tend to be very complex, since they require the simultaneous optimization of a large number of algorithms and parameters within various layers of the protocol stack across multiple users. Specifically, they need to explicitly consider the time-varying channels, networks and applications characteristics, and multi-user dynamic interaction. Most existing solutions for cross-layer optimization in multimedia networks and systems rely on heuristic procedures to solve this problem. However, to obtain an optimal utility for the supported multimedia applications, the cross-layer optimization should be formulated rigorously as a sequential decision problem that takes into account the capability of the various layers to autonomously forecast their own locally experienced dynamics, and perform foresighted adaptation, without violating the current layered architecture of the protocol stack. More importantly, the dynamic multimedia systems are often resource-constrained (e.g. limited network resources, energy budget, computational resources etc.). In this tutorial, we present a unified foresighted cross-layer optimization framework which explicitly considers both the heterogeneity in the multimedia traffic and the dynamics in the wireless networks when optimizing the long-term utilities of the multimedia applications. This proposed framework will allow the multimedia systems to efficiently utilize the limited resources by performing the foresighted optimization. In this tutorial, we establish four separation principles which are essential for designing simple, yet optimal multimedia communication systems.

Details: T07 – A Systematic Framework for Cross-layer Optimization in Dynamic Multimedia Networks and Systems

Duration: half day

T08 – Immersive Future Media Technologies

Christian Timmerer, Karsten Müller
Abstract: The past decade has witnessed a significant increase in the research efforts around the Quality of Experience (QoE) which is generally referred to as a human-centric paradigm for the Quality of a Service (QoS) as perceived by the (end) user. As it puts the end user in the center stage, it may have various dimensions and one dimension recently gained momentum is 3D video. Another dimension aims at going beyond 3D and promises advanced user experience through sensory effects, both introduced briefly in the following. 3D Video: Stereo and Multi-View Video Technology: 3D related media technologies have recently developed from pure research-oriented work towards applications and products. 3D content is now being produced on a wider scale and first 3D applications have been standardized, such as multi-view video coding for 3D Blu Ray disks. This part of the tutorial starts with an overview on 3D in the form of stereo video based systems, which are currently being commercialized. Here, stereo formats and associated coding are introduced. This technology is used for 3D cinema applications and mobile 3DTV environments. For the latter, user requirements and profiling will be introduced as a form to assess user quality of experience. For 3D home entertainment, glasses-free multi-view displays are required, as more than one user will watch 3D content. For such displays, the current stereo solutions need to be extended. Therefore, new activities in 3D video are introduced. These 3D solutions will develop a generic 3D video format with color and supplementary geometry data, e.g. depth maps, and associated coding and rendering technology for any multi-view display, independent of the number of views. As such technology is also developed in international consortia, the most prominent, like the 3D@HOME consortium, the EU 3D, Immersive, Interactive Media Cluster and the 3D video activities in ISO-MPEG are introduced. Advanced User Experience through Sensory Effects: This part of the tutorial addresses a novel approach for increasing the user experience – beyond 3D – through sensory effects. The motivation behind this work is that the consumption of multimedia assets may stimulate also other senses than vision or audition, e.g., olfaction, mechanoreception, equilibrioception, or thermoception that shall lead to an enhanced, unique user experience. This could be achieved by annotating the media resources with metadata (currently defined by ISO/MPEG as part of the MPEG-V standard) providing so-called sensory effects that steer appropriate devices capable of rendering these effects (e.g., fans, vibration chairs, ambient lights, perfumer, water sprayers, fog machines, etc.). In particular, we will review the concepts and details of the forthcoming MPEG-V standard and present our prototype architecture for the generation, transport, decoding and use of sensory effects. Furthermore, we will present details and results of a series of formal subjective quality assessments which confirm that the concept of sensory effects is a vital tool for enhancing the user experience.

Details: T08 – Immersive Future Media Technologies

Duration: half day

T09 – Modeling Human Behavior with Mobile Phones

Daniel Gatica-Perez
Abstract: In just a few years, mobile phones have emerged as the ultimate multimedia device. Current smartphones allow us to take pictures, listen to music, watch videos, interact with the physical world through GPS, communicate via calls, SMS, or MMS, and browse the web. Given their ubiquity, mobile phones have become the most natural device for multimedia consumption, production, and interaction, but there is much more to it. Mobile phones can constantly sense people’s location via GPS or cell tower connectivity, motion through accelerometers, proximity via Bluetooth, and communication through call and SMS logs, and thus represent the most accurate and nonintrusive current means of tracing real-life human activities. Furthermore, all this information, as never before, is being generated at massive scales. It is therefore not surprising that the understanding of personal and social behavior from mobile sensor data at large-scale, where populations of hundreds or thousands of cell phone users are analyzed both as individuals or groups over possibly long periods of time, has emerged as a frontier domain in computing and in social science. This domain has also attracted attention from the media. The concept of Reality Mining, coined at MIT, was identified as one of the 10 technologies “most likely to change the way we live” by Technology Review Magazine in 2008, and featured in mainstream media like Newsweek and The Economist and scientific media like Nature.

Details: T09 – Modeling Human Behavior with Mobile Phones

Duration: half day

T10 – Human-Centered Multimedia Systems

Nicu Sebe, Alejandro (Alex) Jaimes, Hamid Aghajan
Abstract: This tutorial will focus on technical analysis and interaction techniques formulated from the perspective of key human factors in a user-centered approach to developing multimedia systems. The tutorial will take a holistic view on the research issues and applications of Human-Centered Systems, focusing on three main areas: (1) multimodal interaction: visual (body, gaze, gesture) and audio (emotion) analysis; (2) image indexing, and retrieval: user behavior, context modeling, cultural issues, and machine learning for user-centric approaches; (3) multimedia data: conceptual analysis at different levels (feature, cognitive, and affective). This full-day tutorial will consist of two parts: the first half will consist of presentations by the instructors, and the second part will consist of practical workgroup activities.

Details: T10 – Human-Centered Multimedia Systems Tutorial

Duration: full day

T11 – Multimodal Sensing for Healthcare, Sports, and Entertainment

Balakrishnan Prabhakaran
Abstract: Imagine these questions: Will it not be great to monitor the whereabouts a near and dear one afflicted with illness such as Parkinsons or Alzheimers (that may cause them to get lost)? Will it not be perfect if you can exercise those sets of muscles and improve your sports performance? Or know which subtle action you are doing it wrong that is causing your Golf Swing to send the ball totally off the target? Will it not be nice if we can “walk into” a video game and “punch the monster” – without using any joy stick or remote controllers? Or play a music title by waving your hand towards the computer screen? Answering these questions typically requires sensing of human body joint motions. With advances in camera technology, 3D cameras provide depth information or the “z-pixels” also. More advanced techniques use data from 3D cameras in order to better sense human gestures and respond to them. For instance, ZCam [6] has sensors that are able to measure the depth for each of the captured pixels using a principle called Time-Of-Flight. It gets 3D information “by emitting pulses of infra-red light to all objects in the scene and sensing the reflected light from the surface of each object.” The objects in the scene are then ordered in layers in the Z axis, which gives you a grayscale depth map that a game or any software application can use. For the depth resolution, it can detect 3D motion and volume down to 0.4 inches, capturing at the same time full color, 1.3 megapixel video at 60 frames per second. In a similar manner, advances in Body Sensor Network (BSN) technology provide several ways in which human gestures and in some cases, intentions can possibly be recognized. Examples of such devices include accelerometers (for tracking human motions), Electro-myograms (EMG for measuring muscular activities), and Electro-encephala gram (EEG for brain activity monitoring). As one can easily visualize, more than one body sensor as well as video can be used simultaneously. Such a multi-modal sensing might facilitate ease of use as well as the accuracy and efficiency of human motion/gesture recognition. Data from these sensors are typically Time Series data and the data from multiple sensors form multiple, multidimensional time series data. Analyzing data from such multiple medical sensors pose several challenges: different sensors have different characteristics, different people generate different patterns through these sensors, and even for the same person the data can vary widely depending on time and environment. Body Sensor Networks (BSN) data has several similarities to other multimedia data. BSN data may have both discrete and continuous components, with or without real-time requirements. The data can be voluminous. Continuous BSN data may need signal processing techniques for recognition and interpretation. In this tutorial, we discuss the state-of-the-art in the technologies used for video-based and sensor-based human gesture recognition. We dwell on the similarities and differences between the two approaches (vision-based and sensor-based) and evaluate the algorithms and techniques that can be employed. We focus primarily on real-time nature of the algorithms that can be employed for this purpose. We also discuss approaches for classification, data mining, visualization, and securing these data. We also show several demonstrations of body sensor networks as well as the software that aid in analyzing the data.

Details: T11 – Multimodal Sensing for Healthcare, Sports, and Entertainment

Duration: half day

T12 – Designing and Optimizing Large-Scale Multimedia Mining Applications in Distributed Processing Environments

Deepak S. Turaga, Mihaela van der Schaar
Abstract: The numbers and modalities of digital information sources being captured to monitor our traffic, weather, goods, factories, utilities, ports, health, IT infrastructure is continuing to grow at an incredible rate. This has led to great interest in the area of large-scale multimedia processing, analysis, and systems for a wide variety of mining and knowledge extraction applications. The research involved in developing such large-scale applications on distributed systems lies at the intersection of several diverse disciplines, including signal processing, multi-modal multimedia mining and analytics, multimedia streaming systems, large-scale distributed stream processing systems, user interfaces, sensor networks etc. The associated research problems range from designing novel applications and mining algorithms to system issues of resource-adaptation, flow-control, reliability etc., and the intersection of these. Additionally, interactions emerging among inter-connected multimedia stream mining entities can be modeled as a game, in which these entities can strategically interact with each other in order to compete for the dynamically available network or system resources and improve the achievable performance of the application. In this tutorial, we will present the fundamental principles of large-scale adaptive multimedia stream mining, describe state-of-the-art in terms of systems and algorithms, and include recent theoretical and experimental results. We will also discuss how we can construct different cooperative and non-cooperative games to model, analyze, optimize, and shape these applications in different system or connectivity scenarios and under various constraints.

Details: T12 – Designing and Optimizing Large-Scale Multimedia Mining Applications in Distributed Processing Environments

Duration: half day

T13 – Recent Advances in Multimedia Signal Processing for Conferencing Applications

Cha Zhang, Zhengyou Zhang
Abstract: Multimedia signal processing is the building block for many multimedia systems and applications. In this tutorial, we aim to provide an overview of some recent advances in multimedia signal processing, in particular for conferencing applications. Key topics covered by this tutorial include sound source localization from compact microphone arrays, 3D spatial sound and multi-channel echo cancellation, various real-time video processing techniques for enhancing conferencing experiences, and a few explorations on the adoption of the soon-to-be-commodity depth sensors in conferencing applications. We hope the techniques presented in this tutorial can sharpen the tools multimedia researchers use, and help build better multimedia systems in the future.

Details: T13 – Recent Advances in Multimedia Signal Processing for Conferencing Applications

Duration: half day

T14 – Emerging Network Science and Multimedia Researches

Ching-Yung Lin
Abstract: Network as a new scientific discipline is emerging. Researchers from multiple disciplines — electrical engineering, computer science, sociology, public health, economy, management, politics, laws, arts, physics, math, etc – are start collaborating with each other to build up common grounds of network science. Entities — people, information, societies, nations, devices — connect to each other. They form all kinds of intertwined networks, e.g., social and cognitive networks, information networks, communication networks, etc. Researches in these networks reside in different disciplines, but seldom were researchers familiar with the progress outside their own fields. For instance, researches on social network analysis dated back to 1920s already included insightful relationships analysis and multiresolutional network graphs analysis; it was only in the mid 2000s that computer scientists started noticing this field, especially with the popular social networking sites started reaching most peoples’ life 3 years ago. Thus, with lack of knowledge on previous researches, many technologies are reinventing in the latest literature. Network theory is still in an embryonic stage: as such, there is a continuing effort in research communities to model, analyze and understand the interactions among networks. For instance, a common mathematical language appropriate for describing the dynamics, behaviors, and structures or a systematic mathematical formalism that enables predictions of network behavior and network interactions does not currently exist but is emerging. Collective knowledge of network researchers in different disciplines can jointly advance this new science direction and thus also make multidisciplinary integrated system consideration possible. A trans-disciplinary approach is required to lay the foundations of this science. A new 5-to-10 year network science consortium of about 100.

Details: T14 – Emerging Network Science and Multimedia Researches

Duration: half day

Share: