The Great India Data Hunt | Current Affairs | Vision IAS

Daily News Summary

Get concise and efficient summaries of key articles from prominent newspapers. Our daily news digest ensures quick reading and easy understanding, helping you stay informed about important events and developments without spending hours going through full articles. Perfect for focused and timely updates.

News Summary

Sun Mon Tue Wed Thu Fri Sat

The Great India Data Hunt

3 min read

Indian Government's Push for AI Development

Since the beginning of the year, the Indian government has been intensifying its efforts in artificial intelligence (AI). A significant part of this initiative is the allocation of nearly ₹10,000 crore under the IndiaAI mission, which includes subsidies for GPUs and incentives for developing indigenous AI models. The mission's aim is to foster the development of AI models that cater to India's diverse linguistic needs.

Challenges in Building Indic Language Models

  • Startups face a major challenge due to the lack of data in various Indian languages necessary for training AI models.
  • AI crawlers that collect data without permission are being blocked, complicating the data collection process.
  • Indian startups are struggling to match the quality of AI models from companies like OpenAI and Gemini.
  • Building foundational AI models is costly, requiring billions in investment, a resource Indian startups lack.

Alternative Approaches to Data Collection

  • Soket Labs utilizes resources like the Common Crawl Foundation and explores translations, online content, and multimedia content in Indian languages for data.
  • Licensing content from publishing houses for languages such as Gujarati and Urdu is being pursued.
  • Gnani.ai crowdsources Indic language content and asks for voice donations to build its data library.
  • BharatGPT uses client data with permission for model training.

Focus on Specific Problems

  • Gnani.ai emphasizes solving specific issues, such as developing emoting voice AI bots, rather than building large commoditized language models.
  • Curating high-quality data is prioritized, with Soket Labs highlighting the need for 20 trillion tokens for effective training.

Companies Involved in IndiaAI Mission

  • Sarvam AI: 120 billion parameter open-source AI model to enhance governance and public service access.
  • Gan.ai: 70 billion parameter model focused on text-to-speech.
  • Soket Labs: 120 billion open-source parameter foundation model focusing on linguistic diversity in sectors like defense and healthcare.
  • Gnani.ai: 14 billion parameter voice AI model that is multilingual and processes speech in real-time.

Data Availability for Indic Language Models

  • AI4Bharat: 251 billion tokens of Indic language data across 22 languages with plans to collect 10 trillion tokens.
  • AIKosh: Provides datasets across sectors such as agriculture, arts, finance, and energy.
  • Bhashin Vaani Project: Led by IISc, ARTPARK, and Google, aiming to create a dataset of 150,000 hours of speech from 1 million people across 773 districts in India.

Other Datasets for Training AI Models

  • Common Crawl: Web crawlers generate 250 terabytes of data each month.
  • FineWeb-Edu: Provides 1.3 trillion tokens of very high educational content and 5.4 trillion tokens of high educational content.
  • The Stack-V2: Offers 900 billion tokens of coding data.
  • Cosmopedia: Provides 25 billion tokens of synthetic text.
  • Tags :
  • AI Development
  • Indic Language Models
Subscribe for Premium Features

Quick Start

Use our Quick Start guide to learn about everything this platform can do for you.
Get Started