National Institute of Technology Karnataka, Surathkal ie@nitk.edu.in
Deadline for SMP 2021 registration is 8th May. Register here

PROBLEM STATEMENT:

Automatically describing the content of images using natural languages is a fundamental and challenging task. It has great potential impact. For example, it could help visually impaired people better understand the content of images on the web. Also, it could provide more accurate and compact information of images/videos in scenarios such as image sharing in social networks or video surveillance systems. This project accomplishes this task using deep neural networks. By learning knowledge from image and caption pairs, the method can generate image captions that are usually semantically descriptive and grammatically correct.

This application bridges vision and natural language. If we can do well in this task, we can then utilize natural language processing technologies to understand the world in images. In addition, we introduced an attention mechanism, which is able to recognize what a word refers to in the image, and thus summarize the relationship between objects in the image. This will be a powerful tool to utilize the massive unformatted image data, which dominates the whole data in the world.

PROPOSED SOLUTION:

In this project, we develop a framework leveraging the capabilities of artificial neural networks to "caption an image based on its significant features' '. Recurrent Neural Networks (RNN) are increasingly being used as encoding-decoding frameworks for machine translation. Our objective is to replace the encoder part of the RNN with a Convolutional Neural Network (CNN). Thus, transforming the images into relevant input data to feed into the decoder of the RNN. The image will be converted into a multi-feature dataset, characterizing its distinctive features. The analysis will be carried out on the popular Flickr 8K dataset.

TECHNOLOGY USED

Convolutional Neural Networks, InceptionV3 Model, Deep Learning, Keras, Natural Language Processing, Google Colab, Python, Tensorflow, Flickr 8K dataset(training and testing)

METHODOLOGY

1. Collect the data set, clean the Caption Data using Python.
2. Pre-process the images using InceptionV3 Model and convert them to a vector.
3. Pre-process the captions by tokenizing each unique word in the testing set.Hence creating a vocabulary of words.
4. Use a pretrained model with weight to detect objects in the images.
5. Train the Model using CNN with above pre -processed images and captions.
6. Grammatically correct captions are generated word by word using techniques of Natural Language processing.
7. The above weights are saved in a pickle file to predict captions in the future.

RESULTS

Few of our results can be seen in the below.

DEMO

FUTURE WORK

1. Using a better and larger data set
2. Doing hyper parameter tuning
3. Trying different model architectures and choosing the best if any
4. Using a greater loss function

KEY LEARNINGS

Convolutional Neural Networks, InceptionV3 model, Natural Language Processing

CONCLUSION

We can see that the generated sentences expressed the pictures quite well. The main parts of the images can be recognized and shown in the sentence, and also of the minor parts are also encoded. Also, there are some mistakes in some images(fig 3), where the dog is predicted as running . But we human beings are also easy to make such mistakes since there do exist similar objects in the image. The generated sentences also do well in following grammar.

REFERENCES

Following are the papers that we used as reference for our project:

1. https://towardsdatascience.com/image-captioning-with-keras-teaching-computers-to-describe-pictures-c88a46a311b8

TEAM

● Akshit Patel(Mentor) -akshitpatel01@gmail.com
● Niwedita(Mentor) -niwedita.dakshana2017@gmail.com
● Kshitij Raj -kshiteej.raj@gmail.com
● Shreesha Bharadwaj -bharadwajshreesha@gmail.com