Capabilities of O3SLM

Large vision–language models (LVLMs) have transformed how AI understands the world, but they possess a significant ‘blind spot’: abstract visual inputs like hand-drawn sketches. While text is the primary way we interact with these models, it often fails to describe complex shapes or precise spatial arrangements that a simple sketch can convey instantly. Current state-of-the-art models struggle with sketches because they are highly variable, abstract, and fundamentally different from the photorealistic images these models are typically trained on.

A research team from the Visual Computing Lab (Department of Computational and Data Sciences) at the Indian Institute of Science, comprising Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Dr Aditay Tripathi, and Prof Anirban Chakraborty, has successfully addressed this challenge with O3SLM (Open Weight, Open Data, and Open Vocabulary Sketch-Language Model). Presented at the AAAI Conference on Artificial Intelligence (AAAI 2026), O3SLM is the first unified model designed to reason with sketches, photos, and text simultaneously, achieving a level of alignment where previous models consistently failed.

Recognising the lack of open-source datasets for this task, the team introduced SketchVCL, a massive multi-task dataset. Since manually drawing enough sketches for training would be impossible, the team developed an automated generation pipeline and used it to curate over 30 million sketch instances.

O3SLM introduces several groundbreaking capabilities for sketch-based interaction:

  • Object localisation: The model can find and draw precise bounding boxes around specific objects in a photo based on a hand-drawn sketch query.
  • Sketch-based counting: It can accurately count how many instances of a sketched object appear in a complex scene.
  • Sketch-based image retrieval (SBIR): Users can find specific images in a large gallery simply by providing a sketch.
  • Visual question answering (VQA): The model can answer nuanced questions about a photo by referring to specific objects through sketches, such as describing an object’s colour, purpose, or its relationship to its surroundings.

The team implemented a two-stage training curriculum to build these skills. First, a sketch alignment stage familiarises the model with crude sketches and their relationship to natural images. Second, instruction tuning aligns the model to follow specific task-based instructions.

In evaluations, O3SLM outperformed existing open-source models and even surpassed closed-source giants like GPT-4o and Gemini 1.5 Pro on sketch understanding tasks — a significant milestone for human–AI interaction.

Project page: https://vcl-iisc.github.io/O3SLM/

Paper: https://arxiv.org/pdf/2511.14368

Dataset: https://huggingface.co/datasets/anirban-iisc/SketchVCL