2024

Matrivasha-Tribe: A Multilingual Corpus for Indigenous Languages of Bangladesh

Project Description

Indigenous and tribal languages form an integral part of Bangladesh’s cultural identity, yet many of them face the risk of extinction due to limited documentation and declining intergenerational transmission. Project Matrivasha-Tribe addresses this urgent challenge by focusing on the digitization of five tribal languages of Bangladesh. A dedicated multilingual dataset was developed to capture and align these languages with standard Bangla, thereby creating a resource that supports both linguistic preservation and technological development. Data was collected from folklore, oral narratives, community literature, and cultural practices. Native speakers and local experts were actively engaged in the annotation and validation process, ensuring that idiomatic depth, cultural context, and symbolic expressions were retained. The project not only contributes to the field of Natural Language Processing (NLP) for low-resource languages but also empowers tribal communities by fostering digital inclusion, educational opportunities, and socio-economic participation.

Data Type Summary

  • Parallel Corpus: 80,000+ aligned sentences (tribal language → Bangla)
  • Languages Covered: 5 tribal languages of Bangladesh (e.g., Chakma, Marma, Tripura, Santali, Garo)
  • Sources: Folklore, oral storytelling, traditional songs, local literature, and community records
  • Annotation: Community-led validation with linguistic experts to preserve idiomatic and cultural authenticity

Scope

The scope of Matrivasha-Tribe extends to both linguistic preservation and AI inclusivity. Technologically, it provides the first structured dataset for tribal languages of Bangladesh, enabling applications in machine translation, speech recognition, and cross-lingual information retrieval. Culturally, it safeguards endangered languages by digitizing oral traditions and literature that are at risk of disappearing. This dataset is designed to be a foundational resource for researchers, educators, and policymakers aiming to strengthen indigenous knowledge systems, promote inclusive education, and develop technology that serves marginalized tribal communities.

Keywords

Natural Language Processing (NLP), Tribal Languages, Low-resource Languages, Multilingual Dataset, Digital Inclusion, Cultural Preservation

Research Domains

  • Applied NLP in Indigenous and Tribal Languages
  • Cross-lingual and Multilingual Learning
  • Cultural Preservation through Digitization
  • AI for Marginalized and Low-resource Communities