Review on Textual Description of Image Contents

Vasundhara Kadam; Ramesh M. Kagalkar

doi:10.14445/22312803/IJCTT-V30P137

Research Article | Open Access | Download PDF

Volume 30 | Number 1 | Year 2015 | Article Id. IJCTT-V30P137 | DOI : https://doi.org/10.14445/22312803/IJCTT-V30P137

Review on Textual Description of Image Contents

Vasundhara Kadam, Ramesh M. Kagalkar

Citation :

Vasundhara Kadam, Ramesh M. Kagalkar, "Review on Textual Description of Image Contents," International Journal of Computer Trends and Technology (IJCTT), vol. 30, no. 1, pp. 213-217, 2015. Crossref, https://doi.org/10.14445/22312803/IJCTT-V30P137

Abstract

Visual image relation with visually descriptive language is a major challenge for computer vision specifically becoming additional relevant as recognition as well as detection techniques are beginning to work. This paper reviews on techniques that are used for image description such as associations between objects present in that image. Additionally, paper presents an approach to automatically make natural language descriptions from images shortly. This proposed system consists of two parts called content planning and surface realization. The first part, content planning, smooths the output of computer vision-based recognition and detection algorithms with statistics extracted from large groups of visually descriptive text to define the best content words to use to define an image. The another step, surface realization, selects words to build natural language sentences based on the projected content and overall statistics from natural language.

Keywords

Computer vision, image description generation, content planning, surface realization.

References

[1] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Alexander C. Berg, and Tamara L. Berg, “BabyTalk: Understanding and Generating Simple Image Descriptions”, IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 35, no. 12, December 2013.
[2] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester, “Discriminatively Trained Deformable Part Models, Release 4,” http://people.cs.uchicago.edu/pff/latentrelease4/, 2012.
[3] P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, and Y. Choi, “Collective Generation of Natural Image Descriptions,” Proc. Conf. Assoc. for Computational Linguistics, 2012.
[4] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, and T.L. Berg, “Babytalk: Understanding and Generating Simple Image Descriptions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[5] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, and Y. Choi, “Composing Simple Image Descriptions Using Web-Scale n-Grams,” Proc. 15th Conf. Computational Natural Language Learning, pp. 220-228, June 2011.
[6] V. Ordonez, G. Kulkarni, and T.L. Berg, “Im2text: Describing Images Using 1 Million Captioned Photographs,” Proc. Neural Information Processing Systems), 2011.
[7] Y. Yang, C.L. Teo, H. Daume, and Y. Aloimonos, “Corpus-Guided Sentence Generation of Natural Images,” Proc. Conf. Empirical Methods in Natural Language Processing, 2011.
[8] A. Aker and R. Gaizauskas, “Generating Image Descriptions Using Dependency Relational Patterns,” Proc. 28th Ann. Meeting Assoc. for Computational Linguistics, pp. 1250-1258, 2010.
[9] T.L. Berg, A.C. Berg, and J. Shih, “Automatic Attribute Discovery and Characterization from Noisy Web Data,” Proc. European Conf. Computer Vision, 2010.
[10] A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D.A. Forsyth, “Every Picture Tells a Story: Generating Sentences for Images,” Proc. European Conf. Computer Vision, 2010.
[11] Y. Feng and M. Lapata, “How Many Words Is a Picture Worth? Automatic Caption Generation for News Images,” Proc. Assoc. for Computational Linguistics, pp. 1239-1249, 2010.
[12] S. Gupta and R.J. Mooney, “Using Closed Captions as Supervision for Video Activity Recognition,” Proc. 24th AAAI Conf. Artificial Intelligenc, pp. 1083-1088, July 2010.
[13] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting Image Annotations Using Amazon?s Mechanical Turk,” Proc. NAACL HLT Workshop Creating Speech and Language Data with Amazon?s Mechanical Turk, 2010.
[14] A. Torralba, K.P. Murphy, and W.T. Freeman, “Using the Forest to See the Trees: Exploiting Context for Visual Object Detection and Localization,” Comm. ACM, vol. 53, pp. 107-114, Mar. 2010.
[15] M.-C. de Marnee and C.D. Manning, Stanford Typed Dependencies Manual, 2009.
[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[17] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Multi-Class Object Layout,” Proc. 12th IEEE Int?l Conf. Computer Vision, 2009.
[18] A. Farhadi, I. Endres, D. Hoiem, and D.A. Forsyth, “Describing Objects by Their Attributes,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[19] A. Gupta, P. Srinivasan, J. Shi, and L.S. Davis, “Understanding Videos Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[20] S. Gupta and R. Mooney, “Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval,” Proc. IEEE Computer Vision and Pattern Recognition Workshop Visual and Contextual Learning from Annotated Images and Videos, June 2009.
[21] N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar, “Attribute and Simile Classifiers for Face Verification,” Proc. 12th IEEE Int?l Conf. Computer Vision, 2009.
[22] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[23] L.-J. Li and L. Fei-Fei, “OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning,” Int?l J. Computer Vision, vol. 88, pp. 147-168, 2009.
[24] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,” Int?l J. Computer Vision, vol. 81, pp. 2-23, Jan. 2009.
[25] J. Sivic, M. Everingham, and A. Zisserman, ““Who Are You?” Learning Person Specific Classifiers from Video,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[26] J. Wang, K. Markert, and M. Everingham, “Learning Models for Object Recognition from Natural Language Descriptions,” Proc. British Machine Vision Conf., 2009.