National Institute of Technology Karnataka, Surathkal
Click here to give first round of RTFM


Automatically describing the content of images using natural languages is a fundamental and challenging task. It has great potential impact. For example, it could help visually impaired people better understand the content of images on the web. Also, it could provide more accurate and compact information of images/videos in scenarios such as image sharing in social networks or video surveillance systems. This project accomplishes this task using deep neural networks. By learning knowledge from image and caption pairs, the method can generate image captions that are usually semantically descriptive and grammatically correct.

This application bridges vision and natural language. If we can do well in this task, we can then utilize natural language processing technologies to understand the world in images. In addition, we introduced an attention mechanism, which is able to recognize what a word refers to in the image, and thus summarize the relationship between objects in the image. This will be a powerful tool to utilize the massive unformatted image data, which dominates the whole data in the world.


In this project, we develop a framework leveraging the capabilities of artificial neural networks to "caption an image based on its significant features' '. Recurrent Neural Networks (RNN) are increasingly being used as encoding-decoding frameworks for machine translation. Our objective is to replace the encoder part of the RNN with a Convolutional Neural Network (CNN). Thus, transforming the images into relevant input data to feed into the decoder of the RNN. The image will be converted into a multi-feature dataset, characterizing its distinctive features. The analysis will be carried out on the popular Flickr 8K dataset.


Convolutional Neural Networks, InceptionV3 Model, Deep Learning, Keras, Natural Language Processing, Google Colab, Python, Tensorflow, Flickr 8K dataset(training and testing)


1. Collect the data set, clean the Caption Data using Python.
2. Pre-process the images using InceptionV3 Model and convert them to a vector.
3. Pre-process the captions by tokenizing each unique word in the testing set.Hence creating a vocabulary of words.
4. Use a pretrained model with weight to detect objects in the images.
5. Train the Model using CNN with above pre -processed images and captions.
6. Grammatically correct captions are generated word by word using techniques of Natural Language processing.
7. The above weights are saved in a pickle file to predict captions in the future.


Few of our results can be seen in the below.



1. Using a better and larger data set
2. Doing hyper parameter tuning
3. Trying different model architectures and choosing the best if any
4. Using a greater loss function


Convolutional Neural Networks, InceptionV3 model, Natural Language Processing


We can see that the generated sentences expressed the pictures quite well. The main parts of the images can be recognized and shown in the sentence, and also of the minor parts are also encoded. Also, there are some mistakes in some images(fig 3), where the dog is predicted as running . But we human beings are also easy to make such mistakes since there do exist similar objects in the image. The generated sentences also do well in following grammar.


Following are the papers that we used as reference for our project:



● Akshit Patel(Mentor)
● Niwedita(Mentor)
● Kshitij Raj
● Shreesha Bharadwaj