Image Caption Generation Model Based on Object Detector

Authors

  • Aghasi S. Poghosyan Institute for Informatics and Automation Problems of NAS RA
  • Hakob G. Sarukhanyan Institute for Informatics and Automation Problems of NAS RA

DOI:

https://doi.org/10.51408/1963-0016

Keywords:

Neural networks, Image caption, Object detection, Deep learning, RNN, LSTM

Abstract

Automated semantic information extraction from the image is a difficult task. There are works, which can extract image caption or object names and their coordinates. This work presents object detection and automated caption generation implemented via a single model. We have built an image caption generation model on top of object detection model. We have added extra layers on object detector to increase caption generator performance. We have developed a single model that can detect objects, localize them and generate image caption via natural language.

References

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.

A. Poghosyan and H. Sarukhanyan, “Short-term memory with read-only unit in neural image caption generator,” 11-th International Conference Computer Science and Information Technologies, Revised Selected Papers, IEEE Xplore, 10.1109/CSITechnol.2017.8312163, Electronic ISBN: 978-1-5386-2830-0, Print on Demand (PoD) ISBN: 978-1-5386-2831-7, pp. 162– 167, 2017.

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137, 2015.

A. Poghosyan and H. Sarukhanyan, “Rnn with additional constant memory for image caption,” International Academy Journal Web of Scholar, vol. 1, no. 4(13), pp. 3–7, Jul. 2017.

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in European conference on computer vision. Springer, pp. 15–29, 2010.

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891-2903, 2013.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014.

Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.

J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, pp. 379–387, 2016.

C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe, “Scalable, high-quality object detection,” arXiv preprint arXiv:1412.1441, 2014.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, pp. 21–37, 2016.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016.

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” arXiv preprint arXiv:1611.10012, 2016.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.

C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Advances in neural information processing systems, pp. 2553–2561, 2013.

J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.

C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

A. Poghosyan and H. Sarukhanyan, “Image visual similarity based on high level features of convolutional neural networks,” Mathematical Problems of Computer Science, vol. 45, pp. 138–142, 2016.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, pp. 740–755, 2014.

Downloads

Published

2021-12-10

How to Cite

Poghosyan, A. S., & Sarukhanyan, H. G. (2021). Image Caption Generation Model Based on Object Detector. Mathematical Problems of Computer Science, 50, 5–14. https://doi.org/10.51408/1963-0016