AI models, datasets for 11 Indian languages available on internet free of cost

Published on : 22 Sept 2020 1:34 PM IST

Chennai: The Indian Institute of Technology (IIT)-Madras faculty in association with AI4Bharat, developed Artificial Intelligence models and datasets to process texts in 11 Indian regional languages.

The researchers from IIT-Madras and AI4Bharat released AI models and datasets in Tamil, Hindi, Malayalam, Telugu, Kannada, Punjabi, Bengali, Odia, Assamese, Gujarati, and Marathi. The multi-lingual AI models and datasets developed through this initiative will provide the essential building blocks for students, faculty, start-ups and industry to work on Indian language tools and push the frontiers of technology.

The faculty has made these cutting-edge resources open source and completely free of cost, which can be accessed by anyone. These models are freely available and can be downloaded from a Github repository (https://indicnlp.ai4bharat.org/).

“We have a very rich diversity of languages in our country. As we move towards a digital economy, it is important that our languages find a space online. This requires a lot of innovation in creating input tools, datasets, and AI models for Indian languages,” said Dr. Mitesh M. Khapra, assistant professor, Department of Computer Science and Engineering, IIT Madras.

He said: “Imagine a learner who posts a question on an e-learning platform in Tamil or Hindi or any other Indian regional language. There is a need for tools that can automatically process such questions written in Indian languages and classify them into specific topics.”

For the last one year, a team of researchers, comprising students, faculty and volunteers from

IIT-Madras and AI4Bharat, worked on collecting data and training powerful models for processing texts written in Indian languages. The models take advantage of the similarities between Indian languages to make efficient use of data.

With these models, the researchers have been able to make the state-of-the-art technology cater to the needs of people speaking various Indian languages such as document classification, sentiment analysis, semantic matching, paraphrase detection and so on.

“While such tools are available for English and other foreign languages, there are hardly any tools for Indian languages and this is the critical area that we are trying to address through this initiative. These models are available free of cost as we want the entire country to benefit from them,” added Dr. Mitesh Khapra.

AI models, datasets for 11 Indian languages available on internet free of cost

Latest news

Video stories