T02 – Advances in Multimedia Retrieval

Multimedia that cannot be found is, in a certain sense, useless. It is lost in a huge collection, or worse, in a back alley of the Internet, never viewed and impossible to reuse. Research in multimedia retrieval is directed at developing techniques that bring video together with users – matching multimedia content and user needs. The aim of this tutorial is to provide insights into the most recent developments in the field of multimedia retrieval and to identify the issues and bottlenecks that could determine the directions of research focus for the coming years.

This tutorial targets new scientists in the field of multimedia retrieval, providing instruction on how to best approach the multimedia retrieval problem and examples of promising research directions to work on. It is also designed to benefit active multimedia retrieval scientists — those who are searching for new challenges or re-orientation. The material covered is relevant for participants from both academia and industry. It covers issues pertaining to the development of modern multimedia retrieval systems and highlights emerging challenges and techniques anticipated to be important for the future of multimedia retrieval.

This tutorial consists of two parts, which can be attended separately, but together provide a complete overview of recent advances in multimedia retrieval. The first part (morning session) titled Frontiers in Multimedia Search will be presented by Alan Hanjalic and Martha Larson from the Delft University of Technology, The Netherlands, and the second part (afternoon session) titled Video Search Engines will be presented by Cees Snoek and Arnold Smeulders from the University of Amsterdam, The Netherlands. The focus and program of each part are outlined below.

Part I: Frontiers in Multimedia Search

In this part of the tutorial, a picture of the research efforts at the frontiers of multimedia search will be presented to the participants that will encourage them to more effectively direct their own research work, to abandon outdated assumptions about the nature of the field and to keep the focus set on the ever-evolving needs of the user.

We will start by outlining the critical issues concerning usefulness of multimedia search systems, addressing the questions, “How does multimedia search fit into our lives?” and “What do users really want from multimedia search systems?” The issues discussed include:

pictorial vs. thematic relevance (“aboutness”) of a search result,
relevance vs. diversity of top-ranked search results,
simplicity vs. transparency: how much of the search process should the user see?

The main body of the presentation focuses on possibilities for exploiting and combining available information resources to optimize multimedia search results in view of these usefulness issues. We concentrate on three complementary information sources:

User: Exploiting the interaction of the user with the search system, either to enhance the query so that it better reflects the user information need and search intent, or to enrich the collection with implicit or explicit metadata. Approaches discussed include: transaction log analysis, context modeling in multimedia search, (visual) query suggestion and user-supported query expansion.
Collection: Exploiting the information inherent in the relationships that exist in the collection and in the search environment, for example, similarities between documents and connections among users. Two categories of approaches and techniques working in this direction will be discussed:
- Maximizing the quality of the top-ranked search results using IR concepts and cross-modal analysis through e.g., (visual) search reranking, query-class-dependent search and query performance prediction,
- Integrating social information from networked communities, including use of community-contributed metadata and techniques for socially-driven tag propagation and de-noising.
Content: Exploiting all information channels (both individually and combined) in the content collection itself. Automatic indexing systems (e.g., speech recognition, audio event detection, semantic concept detection) are well known for their imperfections. Instead of endless resistance, multimedia search paradigms are required that can robustly deal with noise. Approaches discussed include:
- Lessons learned from spoken content retrieval (feature engineering, confidence scores, lattices),
- Exploiting characteristics of multimedia items that are revealed using simple methods of structural analysis,
- Integrating information from external sources to reduce influence of indexing noise.

The tutorial closes with an examination of the opportunities to formulate research topics that are closely related to the needs of the user and to carry out work in the newly evolving multimedia search paradigms. In particular, the MediaEval (formerly VideoCLEF) benchmarking effort will be presented, which emphasizes innovative multimedia search tasks, such as, for instance, the task on semantic-theme tagging for video. For the semantic theme task, depiction of specific entities, scenes or events in the video is not sufficient, but rather the “aboutness” of a video, i.e., its subject matter taken as a whole, is what is taken to determine relevance to a particular tag. An overview of the tasks and an explanation of the evaluation corpus and criteria will be provided.

Organizers/Presenters

Dr. Alan Hanjalic is an Associate Professor and Coordinator of the Delft Multimedia Information Retrieval Lab at the Delft University of Technology, The Netherlands. He was a visiting scientist at Hewlett-Packard Labs, British Telecom Labs, Philips Research Labs and Microsoft Research Asia. Research interests and expertise of Dr. Hanjalic are in the broad areas of multimedia computing, with focus on multimedia information retrieval and personalized multimedia content delivery. In his areas of expertise Dr. Hanjalic (co-)authored more than 80 publications, among which the books titled Image and Video Databases: Restoration, Watermarking and Retrieval (Elsevier, 2000) and Content-Based Analysis of Digital Video (Kluwer Academic Publishers, 2004). Dr. Hanjalic has been on Editorial Boards of the IEEE Transactions on Multimedia (2006-2010), the IEEE Transactions on Affective Computing, Journal of Multimedia, Advances in Multimedia (Hindawi) and the Image and Vision Computing journal (Elsevier). He was also a Guest Editor of special issues in a number of journals, including the Proceedings of the IEEE (2008), IEEE Transactions on Multimedia (2009), and Journal of Visual Communication and Image Representation (2009). He has also served in the organization committees of the major conferences in the multimedia field, including the ACM Multimedia Conference (General Chair 2009, Program Chair 2007), ACM CIVR conference (Program Chair 2008), ACM ICMR conference (Program Chair 2011), the WWW conference (Track Chair 2008), Multimedia Modeling Conference (Area Chair 2007), Pacific Rim Conference on Multimedia (Track Chair 2007), IEEE ICME (Track Chair 2007), and the IEEE ICIP conference (Track Chair 2010). Dr. Hanjalic was a Keynote Speaker at the Pacific-Rim Conference on Multimedia, Hong-Kong, December 2007 and a member or organizer of the panels at ACM Multimedia 2007 and the ACM MIR 2010 conferences and the Picture Coding Symposium 2007.

Dr. Martha Larson is a senior researcher in the area of video retrieval and speech search in the Multimedia Information Retrieval Lab, Delft University of Technology, The Netherlands. Before joining Delft University of Technology, she researched and lectured in the area of audio-visual retrieval at Fraunhofer IAIS and at the University of Amsterdam. Martha Larson holds a MA and PhD in theoretical linguistics from Cornell University and a BS in Mathematics from the University of Wisconsin. She carries out her research within PetaMedia, an EU Network of Excellence dedicated to improving multimedia access technology by integrating multimedia content analysis and user-based metadata sources, including tags and social network structures. Martha Larson is a co-organizer of MediaEval, a multimedia retrieval benchmark campaign that emphasizes multimodal and especially speech-based multimedia search approaches. She is an organizer of the “Searching Spontaneous Conversational Speech” workshop series, organized since 2007 in conjunction with the ACM SIGIR and ACM Multimedia conferences. She is a guest editor of the upcoming ACM TOIS special issue on searching spontaneous conversational speech. Recently, much of her research focused on deriving and exploiting information from multimedia that is “orthogonal to topic”, not directly related to the subject matter. Examples include information on user-perceived quality and affect. Such information can be used to improve the quality of multimedia search. Her research interests also include user-generated multimedia content, cultural heritage archives, indexing approaches exploiting multiple modalities, techniques for semantic structuring of spoken content and methods for reducing the impact of speech recognition error on speech-based retrieval. She has participated as both researcher and research coordinator in a number of projects including the EU projects MultiMatch and SHARE.

Part II: Video Search Engines

In this part of the tutorial we focus on the challenges in video search, present methods how to achieve state-of-the-art performance, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition – the leading competition for video search engines run by NIST – where we have achieved consistent top performance over the years, including the 2008 and 2009 editions.

The scientific topic of video search is dominated by five major challenges:

the sensory gap between an object and its many appearances due to the accidental sensing conditions;
the semantic gap between a visual concept and its lingual representation;
the model gap between the amount of notions in the world and the capacity to learn them;
the query-context gap between the information need and the possible retrieval solutions;
the interface gap between the tiny window the screen offers to the amount of data;

The semantic gap is bridged by forming a dictionary of visual concept detectors. The largest ones to date consist of hundreds of concepts excluding concept-tailored algorithms. It would simply take too long to achieve. Instead, we come closer to the ideal of one computer vision algorithm tailored automatically to the purpose at hand by employing example data to learn from. We discuss the advantages and limitations of a machine learning approach from examples. We show for what type of concept the approach is likely to succeed or fail. In compensation for the absence of concept-specific (geometric or appearance) models, we emphasize the importance of a good feature sets. They form the basis of the observational model by all possible color, shape, texture or structure invariant features help to characterize the concept at hand. Apart from good features, the other essential component is state-of- the-art machine learning in order to get the most out of the learning data.

We integrate the features and machine learning aspects into a complete concept-based video search engine, which has successfully competed in TRECVID. The multimedia system includes computer vision, machine learning, information retrieval, and human-computer interaction. We follow the video data as they flow through the computational processes. Starting from fundamental visual features, covering local shape, texture, color, motion and the crucial need for invariance. Then, we explain how invariant features can be used in concert with kernel-based supervised learning methods to arrive at a concept detector. We discuss the important role of fusion on a feature, classifier, and semantic level to improve the robustness and general applicability of detectors. We end our component-wise decomposition of video search engines by explaining the complexities involved in delivering a limited set of uncertain concept detectors to an impatient user. For each of the components we review state-of-the-art solutions in literature, each having different characteristics and merits.

Comparative evaluation of methods and systems is imperative to appreciate progress. We discuss the data, tasks, and results of TRECVID, the leading benchmark. In addition, we discuss the many derived community initiatives in creating annotations, baselines, and software for repeatable experiments. We conclude the course with our perspective on the many challenges and opportunities ahead for the multimedia retrieval community.

Organizers/Presenters

Cees Snoek Dr. Cees G.M. Snoek received the M.Sc. degree in business information systems (2000) and the Ph.D. degree in computer science (2005) both from the University of Amsterdam, The Netherlands, where he is currently a senior researcher at the Intelligent Systems Lab Amsterdam. He was a Visiting Scientist at Carnegie Mellon University, USA in 2003 and at UC Berkeley, USA in 2010/2011. His research interests focus on multimedia signal processing and analysis, statistical pattern recognition, content-based information retrieval, social media retrieval, and large-scale benchmark evaluations, especially when applied in combination for video retrieval. He has published over 80 refereed book chapters, journal and conference papers in these fields, and serves on the program committee of several conferences. Dr. Snoek is a lead researcher of the award-winning MediaMill Semantic Video Search Engine, which is a consistent top performer in the yearly NIST TRECVID evaluations. He is initiator and co-organizer of the annual VideOlympics, co-chair of the Multimedia Grand Challenge at ACM Multimedia 2010, and program co-chair of the International Workshop on Content-Based Multimedia Indexing 2010. He is a lecturer of post-doctoral courses given international summer schools and conferences like CVPR and ICCV. He is a member of ACM and IEEE. Dr. Snoek received a young talent (VENI) grant from the Netherlands Organization for Scientific Research in 2008 and a Fulbright Scholarship in 2010.

Arnold Smeulders Prof. dr. ir. Arnold W.M. Smeulders graduated from Technical University of Delft in physics in 1977 (M.Sc.) and in 1982 from Leyden University in medicine (Ph.D.) on the topic of visual pattern analysis. In 1994, he became full professor in multimedia information analysis at the University of Amsterdam. He has an interest in cognitive vision, content-based image retrieval, the picture-language question as well as in systems for the analysis of video. He has written over 250 papers in refereed journals and conferences. He received a Fulbright grant at Yale University in 1987, and a visiting professorship at the City University Hong Kong in 1996, and again at Tsukuba Japan in 1998. In 2000, he was elected fellow of International Association of Pattern Recognition. He was associated editor of IEEE Transactions PAMI. Currently he is associated editor of the International Journal for Computer Vision as well as the IEEE Transactions Multimedia. He is a member of the steering committee of the IEEE’s International Conference on Multimedia and Expo series. He participates in the DELOS and MUSCLE networks of excellence of the EU. He was keynote speaker and chairman of the program committee of conferences including the IEEE Multimedia conference in Florence in 1999, ICIP 2000, CVPR in 2001 and CIVR in 2004 in Dublin. He was general chair of ICME2005 in Amsterdam. In 1996, he was treasurer of the Faculty and director of the Informatics Institute at the University of Amsterdam. Currently, he is scientific director of the Intelligent Systems Lab Amsterdam of 65 staff members, the MultimediaN national public-private partnership of 30 institutions and companies, and of the national research school ASCI. He has graduated 32 PhD-students.