Approach and Challenges of Training an Armenian Version of BERT Language Model
DOI:
https://doi.org/10.51408/1963-0121Keywords:
BERT model, Armenian language, Low-resource language training, Transfer learning, Wikipedia datasetAbstract
Training and deploying BERT models for specific languages, especially low-resource ones, presents a unique set of challenges. These challenges stem from the inherent data scarcity associated with languages like Armenian, the computational demands of training BERT models, often requiring extensive resources, and the inefficiencies in hosting and maintaining models for languages with limited digital traffic. In this research, we introduce a novel methodology that leverages the Armenian Wikipedia as a primary data source, aiming to optimize the performance of BERT for the Armenian language. Our approach demonstrates that, with strategic preprocessing and transfer learning techniques, it's possible to achieve performance metrics that rival those of models trained on more abundant datasets. Furthermore, we explore the potential of fine-tuning pre-trained multilingual BERT models, revealing that they can serve as robust starting points for training models for low-resource but significant languages like Armenian.
References
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages. [Online]. Available: https://aclanthology.org/2020.coling-main.414.pdf
Abad, A., et al. (2019). Cross Lingual Transfer Learning for Zero-Resource Domain Adaptation. [Online]. Available: https://www.pure.ed.ac.uk/ws/files/137077125/Cross_Lingual_Transfer_ABAD_DOA24012020_AFV.pdf
Huang, K.-H., et al. (2021). Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. [Online]. Available: https://aclanthology.org/2021.emnlp-main.126.pdf
Kowsher, M., et al. (2022). Bangla-BERT: Transformer-Based Efficient Model for Transfer Learning and Language Understanding. [Online]. Available: https://ieeexplore.ieee.org/ielx7/6287639/6514899/09852438.pdf
Baller, T., et al. (2021). Transfer Learning and Language Model Adaption for Low Resource Speech Recognition.
K. Azizah, W. Jatmiko et al., Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages. K. Azizah
J. Kim, M. Kumar et al., Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. [Online]. Available: https://arxiv.org/pdf/2111.10047.pdf
V. H. Baghdasaryan, ArmSpeech: Armenian Spoken Language Corpus. [Online]. Available: https://www.ijscia.com/wp-content/uploads/2022/06/Volume3-Issue3-May-Jun-No.283-454-459.pdf
Armenian Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Armenian_Wikipedia
Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech. [Online]. Available: https://aclanthology.org/2021.acl-long.250.pdf
Mingda Chen, Sam Wiseman, Kevin Gimpel. "WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections". [Online]. Available: https://aclanthology.org/2021.findings-acl.17.pdf
Yani Chen, Qi Tian, Hailing Cai, Xudong Lu. "A Semi-Automatic Data Cleaning & Coding Tool for Chinese Clinical Data Standardization". 2022 International Medical Informatics Association (IMIA) and IOS.
Luyu Wang, Yujia Li, Özlem Aslan, Oriol Vinyals. "WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset". [Online]. Available: https://arxiv.org/pdf/2107.09556.pdf
Bhustomy Hakim. "Analisa Sentimen Data Text Preprocessing Pada Data Mining Dengan Menggunakan Machine Learning". Journal of Business and Audit Information Systems Vol 4 (No.2) : 16-22. 2021
H. Bao, L. Dong, F. Wei et al. (2019). Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension. [Online]. Available: https://aclanthology.org/D19-5802.pdf
J. D. Silva, J. Magalhães et al. (2022). Remote sensing visual question answering with a self-attention multi-modal encoder. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3557918.3565874
Zhou, W., Xu, C., & McAuley, J. (2021). BERT Learns to Teach: Knowledge Distillation with Meta Learning. [Online]. Available: https://aclanthology.org/2022.acl-long.485/
Kuan-Hao Huang, et al. (2021). Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. [Online]. Available: https://aclanthology.org/2021.emnlp-main.126.pdf
Arijit Nag, et al. (2021). A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification. [Online]. Available: https://aclanthology.org/2021.conll-1.45.pdf
Venkatesan, N. & Arulanand, N. (2022). Implications of Tokenizers in BERT Model for Low-Resource Indian Language. [Online]. Available: https://irojournals.com/jscp/article/view/4/4/5
D. Grießhabe, J. Maucher. Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning. [Online]. Available: https://aclanthology.org/2020.coling-main.100.pdf
C. B. Dione. Multilingual Dependency Parsing for Low-Resource African Languages: Case Studies on Bambara, Wolof, and Yoruba. [Online]. Available: https://aclanthology.org/2021.iwpt-1.9.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Mikayel K. Gyurjyan and Andranik Hayrapetyan
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.