Approach and Challenges of Training an Armenian Version of BERT Language Model

Mikayel K. Gyurjyan; Andranik Hayrapetyan

doi:10.51408/1963-0121

Authors

Mikayel K. Gyurjyan Institute for Informatics and Automation Problems of NAS RA
Andranik Hayrapetyan Institute for Informatics and Automation Problems of NAS RA

DOI:

https://doi.org/10.51408/1963-0121

Keywords:

BERT model, Armenian language, Low-resource language training, Transfer learning, Wikipedia dataset

Abstract

Training and deploying BERT models for specific languages, especially low-resource ones, presents a unique set of challenges. These challenges stem from the inherent data scarcity associated with languages like Armenian, the computational demands of training BERT models, often requiring extensive resources, and the inefficiencies in hosting and maintaining models for languages with limited digital traffic. In this research, we introduce a novel methodology that leverages the Armenian Wikipedia as a primary data source, aiming to optimize the performance of BERT for the Armenian language. Our approach demonstrates that, with strategic preprocessing and transfer learning techniques, it's possible to achieve performance metrics that rival those of models trained on more abundant datasets. Furthermore, we explore the potential of fine-tuning pre-trained multilingual BERT models, revealing that they can serve as robust starting points for training models for low-resource but significant languages like Armenian.

References

Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages. [Online]. Available: https://aclanthology.org/2020.coling-main.414.pdf

Abad, A., et al. (2019). Cross Lingual Transfer Learning for Zero-Resource Domain Adaptation. [Online]. Available: https://www.pure.ed.ac.uk/ws/files/137077125/Cross_Lingual_Transfer_ABAD_DOA24012020_AFV.pdf

Huang, K.-H., et al. (2021). Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. [Online]. Available: https://aclanthology.org/2021.emnlp-main.126.pdf

Kowsher, M., et al. (2022). Bangla-BERT: Transformer-Based Efficient Model for Transfer Learning and Language Understanding. [Online]. Available: https://ieeexplore.ieee.org/ielx7/6287639/6514899/09852438.pdf

Baller, T., et al. (2021). Transfer Learning and Language Model Adaption for Low Resource Speech Recognition.

K. Azizah, W. Jatmiko et al., Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages. K. Azizah

J. Kim, M. Kumar et al., Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. [Online]. Available: https://arxiv.org/pdf/2111.10047.pdf

V. H. Baghdasaryan, ArmSpeech: Armenian Spoken Language Corpus. [Online]. Available: https://www.ijscia.com/wp-content/uploads/2022/06/Volume3-Issue3-May-Jun-No.283-454-459.pdf

Armenian Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Armenian_Wikipedia

Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech. [Online]. Available: https://aclanthology.org/2021.acl-long.250.pdf

Mingda Chen, Sam Wiseman, Kevin Gimpel. "WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections". [Online]. Available: https://aclanthology.org/2021.findings-acl.17.pdf

Yani Chen, Qi Tian, Hailing Cai, Xudong Lu. "A Semi-Automatic Data Cleaning & Coding Tool for Chinese Clinical Data Standardization". 2022 International Medical Informatics Association (IMIA) and IOS.

Luyu Wang, Yujia Li, Özlem Aslan, Oriol Vinyals. "WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset". [Online]. Available: https://arxiv.org/pdf/2107.09556.pdf

Bhustomy Hakim. "Analisa Sentimen Data Text Preprocessing Pada Data Mining Dengan Menggunakan Machine Learning". Journal of Business and Audit Information Systems Vol 4 (No.2) : 16-22. 2021

H. Bao, L. Dong, F. Wei et al. (2019). Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension. [Online]. Available: https://aclanthology.org/D19-5802.pdf

J. D. Silva, J. Magalhães et al. (2022). Remote sensing visual question answering with a self-attention multi-modal encoder. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3557918.3565874

Zhou, W., Xu, C., & McAuley, J. (2021). BERT Learns to Teach: Knowledge Distillation with Meta Learning. [Online]. Available: https://aclanthology.org/2022.acl-long.485/

Kuan-Hao Huang, et al. (2021). Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. [Online]. Available: https://aclanthology.org/2021.emnlp-main.126.pdf

Arijit Nag, et al. (2021). A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification. [Online]. Available: https://aclanthology.org/2021.conll-1.45.pdf

Venkatesan, N. & Arulanand, N. (2022). Implications of Tokenizers in BERT Model for Low-Resource Indian Language. [Online]. Available: https://irojournals.com/jscp/article/view/4/4/5

D. Grießhabe, J. Maucher. Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning. [Online]. Available: https://aclanthology.org/2020.coling-main.100.pdf

C. B. Dione. Multilingual Dependency Parsing for Low-Resource African Languages: Case Studies on Bambara, Wolof, and Yoruba. [Online]. Available: https://aclanthology.org/2021.iwpt-1.9.pdf

Approach and Challenges of Training an Armenian Version of BERT Language Model

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission