Bangladesh is home to a diverse range of regional languages, many of which remain underrepresented in digital resources and Natural Language Processing (NLP) research. This absence of data limits both technological inclusivity and cultural preservation. Project Matrivasha directly addresses this gap by creating the first multilingual dataset for over 25 regional languages of Bangladesh. A parallel corpus of more than 250,000 sentences was compiled, aligning each regional language with its standard Bangla equivalent. Data sources included folklore, literature, and social media, and underwent rigorous cleaning and annotation. Native speakers and linguistic experts were engaged to ensure contextual accuracy, cultural nuances, and idiomatic richness. Beyond technical contributions, the project empowers marginalized communities with access to modern NLP tools, improving education, communication, and economic participation while offering researchers a unique foundation for low-resource language studies.
The scope of Matrivasha extends across both cultural and technological dimensions. Technically, it provides the largest available dataset for regional Bangladeshi languages, enabling tasks such as machine translation, speech recognition, and sentiment analysis. Culturally, it ensures the preservation of linguistic heritage by digitizing endangered dialects. The dataset is intended for use by researchers, policymakers, and developers seeking to design inclusive AI systems. Additionally, it supports educational technology, e-governance, and local content development, thereby bridging the digital divide for underrepresented linguistic groups.
Natural Language Processing (NLP), Low-resource Languages, Multilingual Dataset, Digital Inclusion, Cultural Preservation