Towards Reliable LLM Reasoning: Coordinated Agents, Variance-aware Evaluation, and Lean Inference

Speaker: Akhil Arora, Assistant Professor, Aarhus University

Date: 25 February 2026

YouTube link: https://youtu.be/0PSjerr4MKo

The talk on ‘Towards Reliable LLM Reasoning Coordinated Agents, Variance-aware Evaluation, and Lean Inference’ by Akhil Arora was organised jointly with the Department of Computational and Data Sciences (CDS) at IISc. The attendees were from different departments in IISc, such as the Department of Aerospace Engineering (AE), Translational AI for Networked Universal Healthcare (TANUH), Department of Electrical Engineering (EE), and Department of Electronic Systems Engineering (ESE). A summary of the talk is provided below.

Large language models (LLMs) are increasingly deployed as reasoning engines, yet their practical use remains constrained by three persistent challenges: achieving high-quality reasoning at low cost, measuring performance reliably, and ensuring efficient, reproducible deployment. In this talk, Akhil Arora presented a research agenda addressing these challenges through new methods, benchmarks, and systems for practical LLM reasoning.

He began with coordination as a pathway to efficiency. Fleet of Agents (FoA) introduces a framework where swarms of lightweight LLM agents explore search spaces in parallel and are resampled through a genetic-style process. This design shows that orchestration can often matter more than sheer size, enabling smaller models to outperform larger ones while achieving superior cost-quality trade-offs across diverse reasoning tasks.

Next, he focussed on evaluation as a foundation for trust. ReasonBench exposes the limits of single-run reporting by systematically quantifying the run-to-run variability of LLM reasoning. Through variance-aware metrics, it reveals the hidden instability and cost unpredictability of many reasoning strategies, highlighting reproducibility as a first-class requirement for reliable reasoning.

Finally, he focussed on systems as enablers of sustainable deployment. He presented CacheSaver, the first modular client-side framework for high-level inference optimisation. By introducing a namespace-aware caching mechanism, CacheSaver reduces cost and carbon emissions while preserving statistical integrity, making large-scale experimentation and deployment more affordable and sustainable without compromising reproducibility. Together, these contributions chart a path toward LLM reasoning that is not only more powerful, but also leaner, more reliable, and environmentally responsible.