Compact N-gram Language Models for Armenian

Authors

  • Davit S. Karamyan Russian-Armenian University
  • Tigran S. Karamyan Yerevan State University

DOI:

https://doi.org/10.51408/1963-0084

Keywords:

Armenian language, N-gram Language Model, Subword Language Model, Pruning, Quantization

Abstract

Applications such as speech recognition and machine translation use language models to select the most likely translation among many hypotheses. For on-device applications, inference time and model size are just as important as performance. In this work, we explored the fastest family of language models: the N-gram models for the Armenian language. In addition, we researched the impact of pruning and quantization methods on model size reduction. Finally, we used Bye Pair Encoding to build a subword language model. As a result, we obtained a compact (100 MB) subword language model trained on massive Armenian corpora.

References

S. Hochreiter and J. Schmidhuber, “Long short-term memory. Neural computation”, vol. 9, no. 8, pp. 1735-1780, 1997.

J. Sarzyska-Wawer, A.Wawer, A. Pawlak, J. Szymanowska, I. Stefaniak, M. Jarkiewicz and . Okruszek, “Detecting formal thought disorder by deep contextualized word representations”, Psychiatry Research, vol. 304, pp. 114–135, 2021.

J. Devlin, M. Chang, K. Lee and K. Toutanova, ”Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv:1810.04805, 2018.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and P.J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer”, arXiv preprint arXiv:1910.10683, 2019.

T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell and others, “Language models are few-shot learners”, arXiv preprint arXiv:2005.14165, 2020.

C. Buck, K. Heafield and B.V. Ooyen, “N-gram counts and language models from the common crawl”, In: LREC, vol. 2, no. 4, 2014.

A. Pauls and D. Klein, “Faster and smaller n-gram language models”. In: Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 258-267, 2011.

D. Guthrie and M. Hepple, “Storing the web in memory: Space efficient language models with constant time retrieval”, In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 262-272, 2010.

U. Germann, E. Joanis and S. Larkin, “Tightly packed tries: How to fit large models into memory, and make them load fast, too”, In: Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA- NLP 2009), pp. 31-39, 2009.

S.D. Hernandez and H. Calvo, “Conll 2014 shared task: Grammatical error correction with a syntactic n-gram language model from a big corpora”, In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 53-59, 2014.

A.Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and others, “Deep speech: Scaling up end-to-end speech recognition”, arXiv preprint arXiv:1412.5567, 2014.

H. Schwenk, D. Dchelotte and J. Gauvain, “Continuous space language models for statistical machine translation”, In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 723-730, 2006.

A.Y. Hannun, A.L. Maas, D. Jurafsky and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns”, arXiv preprint arXiv:1408.2873, 2014.

A.E.D. Mousa, H.J. Kuo, L. Mangu and H. Soltau, “Morpheme-based feature-rich language models using deep neural networks for lvcsr of egyptian arabic”, In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8435-8439, 2013.

V. Siivola, T. Hirsimki, M. Creutz and M. Kurimo, “Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner”, In: Proc. Eurospeech, vol. 3, pp. 2293-2296, 2003.

I. Oparin, “Language models for automatic speech recognition of inflectional languages”, University of West Bohemia, 2008.

D. Yuret and E. Bicici, “Modeling morphologically rich languages using split words and unstructured dependencies”, In: Proceedings of the ACL-IJCNLP 2009 conference short papers, pp.345–348, 2009.

D. Jurafsky, “Speech and language processing”, Pearson Education India, 2000.

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmn, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, “Unsupervised cross-lingual representation learning at scale”, arXiv preprint arXiv:1911.02116, 2019.

A. Malajyan, K. Avetisyan and T. Ghukasyan, “Arpa: Armenian paraphrase detection corpus and models”, In: 2020 Ivannikov Memorial Workshop (IVMEM), pp. 35-39, 2020.

K. Heafield, “Kenlm: Faster and smaller language model queries”, In: Proceedings of the sixth workshop on statistical machine translation, pp. 187-197, 2011.

R. Sennrich, B. Haddow and A. Birch, “Neural machine translation of rare words with subword units”, arXiv preprint arXiv:1508.07909, 2015.

M. Schuster and K. Nakajima, “Japanese and korean voice search”, In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5149-5152, 2012.

T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing”, arXiv preprint arXiv:1808.06226, 2018.

Downloads

Published

2022-06-01

How to Cite

Karamyan, D. S., & Karamyan, T. S. (2022). Compact N-gram Language Models for Armenian. Mathematical Problems of Computer Science, 57, 30–38. https://doi.org/10.51408/1963-0084