2026

REG-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Project Description

Named Entity Recognition (NER) in Bangla has seen considerable progress, yet existing resources remain limited to Standard Bangla and fail to capture the linguistic diversity of regional dialects. These dialects—including Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet—exhibit distinctive phonological, lexical, and syntactic features that pose challenges for existing models. To bridge this gap, Project REG-NER (ANCHOLIK-NER) introduces the first benchmark dataset for dialect-aware NER in Bangla. The dataset consists of 17,405 sentences and 101,817 words, annotated with 10 entity tags spanning people, places, organizations, roles, foods, animals, objects, relations, and cultural expressions. Data was collected from regional texts, folklore, online content, and manually curated translations to ensure entity alignment across dialects. Extensive evaluation with transformer-based models (Bangla BERT, Bangla BERT Base, and Multilingual BERT) revealed that Multilingual BERT achieved the highest performance, particularly in the Mymensingh dialect with an F1-score of 82.61%. However, recognition remains difficult in Chittagong dialect due to structural complexity. REG-NER thus represents a foundational contribution toward inclusive NLP for Bangla, offering the first structured pathway to develop dialect-aware NER systems.

Data Type Summary

Corpus Size: 17,405 sentences, 101,817 words
Entity Tags: 10 categories (Person, Organization, Location, Relation, Food, Animal, Color, Role, Object, Miscellaneous)
Regions Covered: Barishal, Chittagong, Mymensingh, Noakhali, Sylhet
Data Sources: Regional texts, folklore, news, social media, manually aligned translations
Annotation: Expert-validated, ensuring dialectal nuance and entity consistency

Scope

REG-NER is designed to be a benchmark resource for dialect-sensitive NLP research. Technologically, it supports tasks such as NER model evaluation, dialect adaptation of pre-trained transformers, and cross-lingual transfer learning. Linguistically, it documents underrepresented regional variations, preserving local identity in computational systems. The dataset enables practical applications in regional news analysis, healthcare communication, disaster response, e-governance, and localized digital services where dialect plays a critical role. REG-NER also opens avenues for future expansion into more dialects, multimodal resources, and domain-specific entity tagging, ensuring broader inclusivity in Bangla language technologies.

Keywords

Bangla NLP, Regional Dialects, Named Entity Recognition, Benchmark Dataset, Transformer Models, Low-resource Languages

Research Domains

NLP for Low-resource and Dialectal Languages
Benchmarking and Dataset Development
Cross-lingual and Multilingual Information Extraction
Inclusive AI for Social Applications