• AI4Bharat is collecting 10 trillion tokens of language data to build AI models tailored for Indian languages, covering 22 official languages.
  • The initiative, alongside People+ai, aims to address the scarcity of vernacular datasets for AI training, supporting applications in various sectors.

IIT Madras-incubated AI lab AI4Bharat is working on collecting 10 trillion tokens of language data to develop advanced AI services tailored for Indian languages. The effort aims to improve language models by gathering extensive linguistic data from across the country, covering diverse demographics and professions.

Tokens, fundamental units for large language models (LLMs), can be words, characters, or subwords. AI4Bharat co-founder Mitesh Khapra stated that the team has “gone to almost every district in the country” and “tried to cover almost all the 22 official languages” over the past three years. The data collection process involves gathering voice samples and other linguistic inputs from various sources.

Khapra emphasized that AI4Bharat has built its own tools for data collection, and several startups, academic institutions, and deep-tech firms are leveraging this data to develop their own AI models.

“Our data, models and scripts are open-sourced. You can build on top of that,” he said.

The collected data will contribute to the "Ten Trillion Token" project, which focuses on creating native AI models for Indian languages.

“This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect 10 Tn tokens in Indian languages that would be synthetic data that would be language information and cultural information,” he added.

The project is expected to have applications across multiple sectors, including agriculture, education, digital payments, and rural communication. The initiative aims to address a major challenge in AI development—the limited availability of vernacular language datasets. While English dominates online content, with nearly 55% of internet data, Indian languages lack sufficient resources for training AI models.

AI4Bharat’s work aligns with a similar initiative by People+ai, an organization backed by Aadhaar architect Nandan Nilekani. People+ai is also collecting 10 trillion language tokens, primarily from government documents and conversational data, to create datasets essential for training AI foundational models.

Both projects focus on building datasets from the ground up to accurately capture linguistic, grammatical, and cultural nuances in Indian languages.

Khapra’s remarks come a year after AI4Bharat introduced IndicVoices, an open-source speech dataset funded by the Ministry of Electronics and IT’s Bhashini initiative and other non-profits. The dataset spans 22 Indian languages and is intended to support AI-driven speech technologies.


Edited by Harshajit Sarmah