Multimodal Generative LLMs: Unification, Interpretability, Evaluation

Mohit Bansal, John R & Louise S Parker Professor and Director of the MURGe-Lab (UNC-NLP Group), Department of Computer Science, UNC Chapel Hill

– 8 January 2024

Talk summary: In this talk, Mohit Bansal presented his team’s journey on large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts), enhancing important aspects such as unification (for generalisability, shared knowledge, and efficiency), interpretable programming/planning (for controllability and faithfulness) and evaluation (of fine-grained skills, faithfulness, and social biases). He start by discussing early cross-modal vision-and-language pretraining models (LXMERT). He then looked at early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual common sense reasoning, captioning, and multimodal translation) by treating all tasks as text generation. Next, he looked at recent progressively more unified models (with joint objectives and architecture, as well as newer unified modalities during encoding and decoding) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), and composable any-to-any text-audio-image-video multimodal generation (CoDi).

He also discussed interpretable and controllable multimodal generation (to improve faithfulness) via large language model (LLM)-based planning and programming, such as layout-controllable image generation via visual programming (VPGen), consistent multi-scene video generation via LLM-guided planning (VideoDirectorGPT), and open-domain, open-platform diagram generation (DiagrammerGPT). He concluded with important faithfulness and bias evaluation aspects of multimodal generation models, based on fine-grained skill and social bias evaluation (DALL-Eval), interpretable and explainable visual programs (VPEval), as well as reliable fine-grained evaluation via Davidsonian Semantics (DSG).

Speaker bio: Mohit Bansal is the John R & Louise S Parker Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at UNC Chapel Hill. He received his PhD from UC Berkeley in 2013 and his BTech from IIT Kanpur in 2008. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on multimodal generative models, grounded and embodied semantics, faithful language generation, and interpretable and generalisable deep learning. He is a recipient of the IIT Kanpur Young Alumnus Award, DARPA Director’s Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. He has been a keynote speaker for the AACL 2023, CoNLL 2023, and INLG 2022 conferences. His service includes the ACL Executive Committee, ACM Doctoral Dissertation Award Committee, CoNLL Program Co-Chair, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals.

[Talk organised in collaboration with the Department of Computational and Data Sciences]