Towards transforming the Indian language technology ecosystem

The Kotak IISc AI-ML Centre launched its Distinguished Seminar Series with a talk by Mitesh M Khapra, Associate Professor in the Department of Computer Science and Engineering and Head of the AI4Bharat Research Lab at the Indian Institute of Technology Madras, Chennai. The talk was hosted in the A V Rama Rao auditorium, Chemical Sciences building, Indian Institute of Science, Bengaluru on 20 September 2024.

The mission of the AI4Bharat Lab is to “bring parity with respect to English in AI technologies for Indian languages with open-source contributions in datasets, models, and applications”. In his talk, Khapra highlighted the need to focus on language technology in India. As per the Indian 2011 Census, there is only a 36% chance that two Indians selected at random can even talk to each other. Speech and language technologies are absolutely essential to bridge the language and digital divide in India. Khapra pointed out that people cannot be asked to choose between ‘going digital’ and ‘staying regional’.

The current large language models (LLMs) are predominantly for the English language. The AI4Bharat team focusses on harnessing the power of LLMs for regional languages. In 2021, the status of Indian languages in terms of machine translation, language understanding, speech recognition, and speech synthesis was far behind that of other languages. The reason that good language technology is not available for Indian languages is that there is a lack of data and data democratisation; the latter is essential for everyone to use the available data.

The cost and time required to create a corpus that can be used to train speech recognition models is very high. For the 22 constitutionally-recognised Indian languages, each with around one million sentences, with each sentence having around 15 words, and the cost per word taken as Rs 3, the cost would approximately be Rs 99 crores. Is there another way? A recipe that has been tried and tested for English and other languages is to take a large amount of (noisy) data from the web, add a small amount of manually-created (clean) data, and create large-scale (multilingual) models.

Can this be done for Indian languages? Khapra’s team decided to go heavy on engineering so that they could replicate what had been done in DeepTech. For this, robust data pipelines were used, workflows were managed among distributed teams, and runs were optimised on multi-mode GPUs (graphics processing units). Their manual data collection process that ensures standardisation, centralisation, mobilisation for diversity, maximum engagement, and quality.

None of the previously-existing benchmarks were source-original, that is, the sentences used in these benchmarks were translations of English sentences. The AI4Bharat team has created source-original benchmarks, with the original language as Hindi. This is the first model to support 22 Indian languages. How has this translated to on-the-ground solutions? Khapra gave examples through the following slide.

Some of the team’s latest contributions to Indian language technology include:

  1. IndicASR: The first speech recognition model supporting all 22 Indian languages.
    [https://ai4bharat.iitm.ac.in/areas/model/ASR/IndicConformer]
  2. Anudesh: An open-source data annotation platform aimed at advancing LLMs for Indian languages. Anudesh v1 supports workflows for conversation collection by interacting with LLMs and model evaluation.
    [https://ai4bharat.iitm.ac.in/tools/Anudesh]
  3. Rasa: A dataset for expressive text-to-speech (TTS) dataset covering nine languages and 14 speakers.
    [https://rasa.ai4bharat.org/]

IndicVoices-R, the largest multilingual Indian TTS dataset derived from an ASR (automatic speech recognition) dataset was released in September 2024. Another model is to be released soon.