2024

Matrivasha: An Exploratory Effort Towards Building Multilingual Resources for Underrepresented Languages in Bangladesh

Project Description

Bangladesh is home to a diverse range of regional languages, many of which remain underrepresented in digital resources and Natural Language Processing (NLP) research. This absence of data limits both technological inclusivity and cultural preservation. Project Matrivasha directly addresses this gap by creating the first multilingual dataset for over 25 regional languages of Bangladesh. A parallel corpus of more than 250,000 sentences was compiled, aligning each regional language with its standard Bangla equivalent. Data sources included folklore, literature, and social media, and underwent rigorous cleaning and annotation. Native speakers and linguistic experts were engaged to ensure contextual accuracy, cultural nuances, and idiomatic richness. Beyond technical contributions, the project empowers marginalized communities with access to modern NLP tools, improving education, communication, and economic participation while offering researchers a unique foundation for low-resource language studies.

Data Type Summary

  • Parallel Corpus: 250,000+ aligned sentences (regional → Bangla)
  • Languages Covered: 25+ underrepresented Bangladeshi regional languages
  • Sources: Folklore texts, local literature, oral history, and social media content
  • Annotation: Expert-validated translations with cultural and idiomatic preservation

Scope

The scope of Matrivasha extends across both cultural and technological dimensions. Technically, it provides the largest available dataset for regional Bangladeshi languages, enabling tasks such as machine translation, speech recognition, and sentiment analysis. Culturally, it ensures the preservation of linguistic heritage by digitizing endangered dialects. The dataset is intended for use by researchers, policymakers, and developers seeking to design inclusive AI systems. Additionally, it supports educational technology, e-governance, and local content development, thereby bridging the digital divide for underrepresented linguistic groups.

Keywords

Natural Language Processing (NLP), Low-resource Languages, Multilingual Dataset, Digital Inclusion, Cultural Preservation

Research Domains

  • Applied NLP in Low-resource Settings
  • Multilingual and Cross-lingual Learning
  • Digital Humanities and Linguistic Preservation
  • AI for Social Good