Teaching Multimodal AI to Know When It Might Be Wrong

Multimodal large language models can now answer questions about images, audio, and text. But in real-world settings, being accurate is not enough. A trustworthy AI system should also know when it is likely to be wrong and choose to abstain instead of giving a misleading answer.

This is especially important for multimodal AI systems, where errors can arise not only from language reasoning, but also from poor visual or audio grounding.

To address this, researchers at the LEAP Lab, Indian Institute of Science developed FESTA — Functionally Equivalent Sampling for Trust Assessment, a black box uncertainty estimation framework for multimodal LLMs.

Core idea

FESTA asks two simple but powerful questions:

  1. Consistency:
    If the input is changed without changing its meaning, does the model keep the same answer?
  2. Sensitivity:
    If the input is changed in a way that should change the answer, does the model actually change its response?

A reliable model should be both consistent and sensitive. If the model fails either test, FESTA assigns higher uncertainty.

Key findings

  • FESTA improves misprediction detection for both vision–language and audio–language models.
  • It is especially useful for detecting low-uncertainty hallucinations, where standard entropy-based methods can fail.

Paper details

Title: FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
Authors: Debarpan Bhattacharya, Apoorva Kulkarni, and Sriram Ganapathy
Published in: Findings of the EMNLP 2025, Suzhou, China

Paper link: https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.657.pdf