Link copied successfully!

Daily News Summary

Get concise and efficient summaries of key articles from prominent newspapers. Our daily news digest ensures quick reading and easy understanding, helping you stay informed about important events and developments without spending hours going through full articles. Perfect for focused and timely updates.

News Summary

Sun Mon Tue Wed Thu Fri Sat

Newspaper

The Great India Data Hunt

Posted 16 Sep 2025

3 minread

Follow on Google

Indian Government's Push for AI Development

Since the beginning of the year, the Indian government has been intensifying its efforts in artificial intelligence (AI). A significant part of this initiative is the allocation of nearly ₹10,000 crore under the IndiaAI mission, which includes subsidies for GPUs and incentives for developing indigenous AI models. The mission's aim is to foster the development of AI models that cater to India's diverse linguistic needs.

Challenges in Building Indic Language Models

Startups face a major challenge due to the lack of data in various Indian languages necessary for training AI models.
AI crawlers that collect data without permission are being blocked, complicating the data collection process.
Indian startups are struggling to match the quality of AI models from companies like OpenAI and Gemini.
Building foundational AI models is costly, requiring billions in investment, a resource Indian startups lack.

Alternative Approaches to Data Collection

Soket Labs utilizes resources like the Common Crawl Foundation and explores translations, online content, and multimedia content in Indian languages for data.
Licensing content from publishing houses for languages such as Gujarati and Urdu is being pursued.
Gnani.ai crowdsources Indic language content and asks for voice donations to build its data library.
BharatGPT uses client data with permission for model training.

Focus on Specific Problems

Gnani.ai emphasizes solving specific issues, such as developing emoting voice AI bots, rather than building large commoditized language models.
Curating high-quality data is prioritized, with Soket Labs highlighting the need for 20 trillion tokens for effective training.

Companies Involved in IndiaAI Mission

Sarvam AI: 120 billion parameter open-source AI model to enhance governance and public service access.
Gan.ai: 70 billion parameter model focused on text-to-speech.
Soket Labs: 120 billion open-source parameter foundation model focusing on linguistic diversity in sectors like defense and healthcare.
Gnani.ai: 14 billion parameter voice AI model that is multilingual and processes speech in real-time.

Data Availability for Indic Language Models

AI4Bharat: 251 billion tokens of Indic language data across 22 languages with plans to collect 10 trillion tokens.
AIKosh: Provides datasets across sectors such as agriculture, arts, finance, and energy.
Bhashin Vaani Project: Led by IISc, ARTPARK, and Google, aiming to create a dataset of 150,000 hours of speech from 1 million people across 773 districts in India.

Other Datasets for Training AI Models

Common Crawl: Web crawlers generate 250 terabytes of data each month.
FineWeb-Edu: Provides 1.3 trillion tokens of very high educational content and 5.4 trillion tokens of high educational content.
The Stack-V2: Offers 900 billion tokens of coding data.
Cosmopedia: Provides 25 billion tokens of synthetic text.

Tags:

AI Development Indic Language Models

Articles Sources

https://economictimes.indiatimes.com/epaper/delhicapital/2025/sep/16/eye-on-ai/the-great-india-data-

Private Sector participation to Build Oil Reserve in Karnataka

In AI world, regulation must keep pace with tech: FM Nirmala Sitharaman

Explore Related Content

Discover more articles, videos, and terms related to this topic

PM Modi’s message at AI Action Summit: Open source, sustainability, job optimism

Daily News Summary

How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

Daily News Summary

Copyright Debate Erupts as AI Film Heads to Cannes

Daily News Summary

Search Notes

Filter Notes

Subject

Topic

Loading your notes...

Searching your notes...

Loading more notes...

You've reached the end of your notes

No notes yet

Create your first note to get started.

No notes found

Try adjusting your search criteria or clear the search.

Saving...

Saved

Subject *

Please select a subject.

Topic

Referenced Articles

linked

No references added yet

Your Success Starts Here • Upgrade to Premium Today

Notes Ecosystem

Connect With Us

English

Notes Ecosystem

Connect With Us

English

Share

My Highlights

Daily News Summary

News Summary

Newspaper

The Great India Data Hunt

Indian Government's Push for AI Development

Challenges in Building Indic Language Models

Alternative Approaches to Data Collection

Focus on Specific Problems

Companies Involved in IndiaAI Mission

Data Availability for Indic Language Models

Other Datasets for Training AI Models

Articles Sources

Explore Related Content

RELATED ARTICLES

PM Modi’s message at AI Action Summit: Open source, sustainability, job optimism

How the DeepSeek-R1 AI model was taught to teach itself to reason | Explained

Copyright Debate Erupts as AI Film Heads to Cannes

Search Notes

Filter Notes

No notes yet

No notes found

Referenced Articles

Login Required

Welcome Back!