Indigenous and tribal languages form an integral part of Bangladesh’s cultural identity, yet many of them face the risk of extinction due to limited documentation and declining intergenerational transmission. Project Matrivasha-Tribe addresses this urgent challenge by focusing on the digitization of five tribal languages of Bangladesh. A dedicated multilingual dataset was developed to capture and align these languages with standard Bangla, thereby creating a resource that supports both linguistic preservation and technological development. Data was collected from folklore, oral narratives, community literature, and cultural practices. Native speakers and local experts were actively engaged in the annotation and validation process, ensuring that idiomatic depth, cultural context, and symbolic expressions were retained. The project not only contributes to the field of Natural Language Processing (NLP) for low-resource languages but also empowers tribal communities by fostering digital inclusion, educational opportunities, and socio-economic participation.
The scope of Matrivasha-Tribe extends to both linguistic preservation and AI inclusivity. Technologically, it provides the first structured dataset for tribal languages of Bangladesh, enabling applications in machine translation, speech recognition, and cross-lingual information retrieval. Culturally, it safeguards endangered languages by digitizing oral traditions and literature that are at risk of disappearing. This dataset is designed to be a foundational resource for researchers, educators, and policymakers aiming to strengthen indigenous knowledge systems, promote inclusive education, and develop technology that serves marginalized tribal communities.
Natural Language Processing (NLP), Tribal Languages, Low-resource Languages, Multilingual Dataset, Digital Inclusion, Cultural Preservation