Journal
Preprint

ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Abstract

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences distributed across five regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models – Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased – on this dataset. Our findings demonstrate that BERT Base Multilingual Cased performs best in recognizing named entities across regions, with significant performance observed in Mymensingh with an F1-score of 82.611%. Despite strong overall performance, challenges remain in region like Chittagong, where the models show lower precision and recall. Since no previous NER systems for Bangla regional dialects exist, our work represents a foundational step in addressing this gap. Future work will focus on improving model performance in underperforming regions and expanding the dataset to include more dialects, enhancing the development of dialect-aware NER systems.

Keywords

Named Entity Recognition · Low Resource Language · Bangla Language · Regional Dialects · Natural Language Processing

BibTeX Citation

@article{paul2025ancholik,
title={ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition},
author={Paul, Bidyarthi and Preotee, Faika Fairuj and Sarker, Shuvashis and Refat, Shamim Rahim and Islam, Shifat and Muhammad, Tashreef and Hoque, Mohammad Ashraful and Manzoor, Shahriar},
journal={arXiv preprint arXiv:2502.11198},
year={2025}
}

Publication Details

Research Type

Research

Status

Preprint

Publication Year

2025

Publication Type

Journal

Research Domains

Natural Language Processing