Decoding Movies: Holistic Understanding of Situations and Characters

Makarand Tapaswi, Assistant Professor, Center for Visual Information Technology (CVIT), International Institute of Information Technology (IIIT) Hyderabad

– 11 June 2024

Talk summary: Despite tremendous advances in LLMs (large language models) and VLMs (vision language models), fine-grained visual understanding remains elusive. With the additional complexity of time, even short video clips are easily misunderstood. In this talk, Tapaswi described his research group’s efforts on holistic video understanding. Starting with situation recognition that answers “who is doing what to whom, where, and how”, he introduced a new approach for weakly supervised spatio-temporal grounding of such concepts in the video. Next, he showed how dense captions derived from video clips can be used to efficiently and effectively adapt vision–language models like CLIP. Beyond structured outputs and towards coherent video descriptions, characters are an important pre-requisite for long video understanding. Tapaswi explored enhancing existing captions through fill-in-the-blanks or generating identity-aware captions. Finally, he shared a quick overview of other related works on predicting movie character emotions, generating television episode summaries, improving image captioning systems, and a new benchmark for evaluating VLMs.

Speaker bio: Makarand Tapaswi is an Assistant Professor at CVIT, IIIT Hyderabad. His research focuses on machine understanding of videos, language, and human behaviour, particularly in analysing storylines in movies and television series. Before joining IIIT, Makarand has led and contributed to major projects such as HowTo100M, MovieQA, MovieGraphs, and clustering and identifying characters in videos. Additionally, Makarand is a Senior Machine Learning Scientist at Wadhwani AI, a non-profit, applied AI Institute, focusing on applications of AI in healthcare and education.

https://makarandtapaswi.github.io/

[Talk organised in collaboration with the Department of Computational and Data Sciences]