Image Caption Generation and Object Detection via a Single Model

Authors

  • Aghasi S. Poghosyan Institute for Informatics and Automation Problems of NAS RA

Keywords:

Neural networks, Image caption, Object detection, Deep learning, RNN, LSTM

Abstract

Automated semantic information extraction from the image is a difficult task. There are works which can extract image caption or object names and their coordinates. This work presents a merged single model of object detection and automated caption generation systems. The final model extracts from image caption and object coordinates with their names without losing accuracy according to initial models.

References

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., "Imagenet large scale visual recognition challenge", International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, "Every picture tells a story: Generating sentences from images", in European conference on computer vision. Springer, 2010, pp. 15-29.

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, "Babytalk: Understanding and generating simple image descriptions", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891- 2903, 2013.

A. Karpathy and L. Fei-Fei, "Deep visualsemantic alignments for generating image descriptions", in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128-3137.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator", in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156-3164.

D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate", arxiv preprint arxiv:1409.0473, 2014.

I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks", in Advances in neural information processing systems, 2014, pp. 3104-3112.

Y. LeCun, Y. Bengio, et al., "Convolutional networks for images, speech, and time series", The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.

S. Ren, K. He, R. Girshick, and J. Sun, "Faster rcnn: Towards realtime object detection with region proposal networks", in Advances in neural information processing systems, 2015, pp. 91-99.

J. Dai, Y. Li, K. He, and J. Sun, "R-fcn: Object detection via regionbased fully convolutional networks", in Advances in neural information processing systems, 2016, pp. 379-387.

C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe, "Scalable, high-quality object detection", arxiv preprint arxiv:1412.1441, 2014.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector", in European conference on computer vision. Springer, 2016, pp. 21-37.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, realtime object detection", in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779-788.

Z. Yang, Y.J. Zhang, Y. Huang et al., Image captioning with object detection and localization," arxiv preprint arxiv:1706.02430, 2017.

C.Szegedy, V.Vanh" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 281 8{2826, 2016.

R.Girshick, J.Donahue, T. Darrell and J.Malik, Rich feature hierarchies for accurateobject detection and semantic segmentation", in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580{587, 2014.

A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classi¯cation with deep convolutional neural networks", in Advances in neural information processing systems, pp.1097 1105, 2012

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna,Y. Song, S. Guadarrama et al., Speed/accuracy trade-offs for modern convolutional object detectors", arxiv preprint arxiv:1611.10012, 2016

M. D. Zeiler, Adadelta: an adaptive learning rate method", arxiv preprint arxiv:1212.5701, 2012

T.Mikolov, K. Chen, G. Corrado and J. Dean, Effcient estimation of word representations in vector space", arxiv preprint arxiv:1301.3781, 2013

S.Hochreiter and J.Schmidhuber, Long shortterm memory", Neural computation,vol. 9, no. 8, pp. 1735{1780, 1997

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P.Perona, D. Ramanan, P. Dollar and C.L. Zitnick, Microsoft coco: Common objects in context", in European conference on computer vision, Springer, pp. 740{755, 2014.

Downloads

Published

2021-12-10

How to Cite

Poghosyan, A. S. (2021). Image Caption Generation and Object Detection via a Single Model. Mathematical Problems of Computer Science, 48, 42–49. Retrieved from http://mpcs.sci.am/index.php/mpcs/article/view/119